Skip to main content

Rules Engine Documentation

The Rules Engine is Stage 1 of the classification pipeline - a fast, free classification system based on pattern matching. It classifies URLs by analyzing domains, TLDs, URL paths, and known patterns before falling back to more expensive stages (Vectorize, LLM).


Overview

How Pattern Matching Works

The rules engine processes URLs through a series of cascading rules, each with associated confidence scores. Rules are evaluated in priority order, and the first high-confidence match typically wins.

URL Input
|
v
+------------------+
| 1. Owned Domain | --> 100% confidence if URL matches target_domain
+------------------+
|
v
+------------------+
| 2. Domain DB | --> 95% confidence (4,400+ curated domains)
+------------------+
|
v
+------------------+
| 3. TLD Rules | --> 95% confidence (.gov, .edu, etc.)
+------------------+
|
v
+------------------+
| 4. Platform | --> 85% confidence (YouTube, Reddit, etc.)
+------------------+
|
v
+------------------+
| 5. URL Patterns | --> 70-90% confidence (path-based detection)
+------------------+
|
v
+------------------+
| 6. Spam Signals | --> 70-85% confidence (risk detection)
+------------------+
|
v
Classification Result (or pass to Vectorize/LLM)

Classification Output Structure

Each dimension is classified independently with its own confidence:

{
domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
tier1_type: { value: "platform", confidence: 95, source: "domain_database_v2" },
structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" },
channel_bucket: { value: "pr_earned_media", confidence: 80, source: "derived_from_domain_type" },
page_type: { value: "blog_post", confidence: 75, source: "url_path_pattern" },
quality_tier: { value: "tier_1", confidence: 100, source: "domain_rank" },
modifiers: ["contextual_link"],

classification_source: "rule",
final_confidence: 95, // Max of all dimensions
rules_applied: ["domain_database_v2", "url_path_pattern"],
needs_vectorize: false,
needs_llm: false
}

Domain Database

The domain database contains 4,400+ curated domain classifications, providing high-confidence (95%) instant classification for known domains.

Database Structure

Location: packages/api/src/data/domain-database.js

Generated from: classification-data/domains/_master.csv

export const DOMAIN_DATABASE = {
"salesforce.com": {
domain_type: "saas_product",
tier1_type: "platform",
industry: "crm",
subcategory: "enterprise_crm",
status: "covered",
source: "v2_master",
},
"nytimes.com": {
domain_type: "news_publisher",
tier1_type: "information",
industry: "news",
subcategory: "general",
status: "covered",
source: "v2_master",
},
// ... 4,400+ more domains
};

Domain Type Distribution

Domain TypeCountExamples
saas_product1,639Slack, Notion, Salesforce
ecommerce_store536Nike.com, Warby Parker
government_site257IRS.gov, CDC.gov
travel_booking256Delta, Marriott
financial_institution245Chase, PayPal
content_publisher200+BuzzFeed, HuffPost
healthcare_institution150+Mayo Clinic, Cleveland Clinic

Lazy Loading

The domain database is lazy-loaded to avoid 858KB cold start penalty:

let domainDbModule = null;

async function getDomainDb() {
if (!domainDbModule) {
domainDbModule = await import("../../data/domain-database.js");
}
return domainDbModule;
}

Adding Domains to Database

  1. Edit classification-data/domains/_master.csv
  2. Run npm run build:domains
  3. The generated domain-database.js is auto-generated

URL Patterns

Pattern Syntax

URL patterns use JavaScript regular expressions with specific conventions:

{
name: "blog_post", // Human-readable identifier
pattern: /\/(blog|blogs)\/[^\/]+\/?$/i, // Regex pattern
page_type: "blog_post", // Classification result
page_type_category: "editorial", // Category grouping
confidence: 75, // 0-100 score
priority: 21, // Lower = checked first
}

Pattern Components

ComponentDescriptionExample
^Start of string^https?:\/\/
$End of string\/?$/
\/Literal slash\/blog\/
[^\/]+Any chars except slash/blog/[^\/]+/
\d+One or more digits/post/\d+
[a-z]{2}Exactly 2 letters/en/, /fr/
(?:...)Non-capturing group`(?:blog
(...)Capturing groupAvoid in patterns
iCase insensitive flag/BLOG/i matches /blog/

Common Pattern Examples

// Homepage detection
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/

// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-post/, /blogs/article-title

// Date-based article (news sites)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news

// Profile page with @ symbol
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/

// Product page with ID
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes/

// Listicle pattern
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices-for-...

// State-based location page
/\/[a-z-]+-(?:ca|tx|ny|fl)\/?$/i
// Matches: /houston-tx/, /los-angeles-ca/

Pattern Categories

Generic Path Patterns

Location: packages/api/src/data/url-patterns/generic-patterns.js

Platform-agnostic patterns that work across any website:

Editorial Patterns (priority 20-29)

// Blog/Article index
/\/(blog|blogs|articles?|news|posts?|stories)\/?$/i
// -> category_index_page (confidence: 80)

// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// -> blog_post (confidence: 75)

// Date-based articles
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// -> news_article (confidence: 85)

// Press releases
/\/(press-releases?|news-releases?|newsroom)\/[^\/]+/i
// -> press_release (confidence: 85)

// Case studies
/\/(case-stud(y|ies)|success-stor(y|ies))\/[^\/]+/i
// -> case_study_page (confidence: 80)

Listicle Patterns (priority 29)

// Number-prefixed listicles
/\/top-\d+[-\/]/i // -> listicle (90)
/\/best-\d+[-\/]/i // -> listicle (90)
/\/\d+-best[-\/]/i // -> listicle (90)
/\/\d+-ways[-\/]/i // -> listicle (85)
/\/\d+-tips[-\/]/i // -> listicle (85)
/\/\d+-reasons[-\/]/i // -> listicle (85)
/\/\d+-tools[-\/]/i // -> listicle (80)

// Comparison patterns
/\/-vs-/i // -> comparison_page (85)
/\/-versus-/i // -> comparison_page (85)

Commercial Patterns (priority 31-50)

// Product pages
/\/(products?)\/[^\/]+\/?$/i
// -> product_page (confidence: 80)

// Category pages
/\/(categor(y|ies)|collections?)\/[^\/]+/i
// -> category_page (confidence: 75-80)

// Pricing
/\/(pricing|plans|packages)\/?$/i
// -> pricing_page (confidence: 90)

// Checkout
/\/(cart|basket|checkout)\/?$/i
// -> checkout_page (confidence: 95)

// Landing pages
/\/(get-started|demo|request-demo|contact-sales)\/?$/i
// -> landing_page (confidence: 85)

Documentation Patterns (priority 51-65)

// Docs index
/\/(docs?|documentation)\/?$/i
// -> category_index_page (confidence: 85)

// Doc pages
/\/(docs?|documentation)\/[^\/]+/i
// -> documentation_page (confidence: 80)

// API reference
/\/(api-?(docs|reference)?)\/[^\/]+/i
// -> api_reference_page (confidence: 85)

// Guides/tutorials
/\/(guides?|tutorials?)\/[^\/]+/i
// -> howto_article (confidence: 80)

// FAQ
/\/(faq|frequently-asked)\/?/i
// -> faq_page (confidence: 85-90)

Utility Patterns (priority 70-95)

// About pages
/\/(about|about-us|company)\/?$/i
// -> about_page (confidence: 85)

// Contact
/\/(contact|contact-us)\/?$/i
// -> contact_page (confidence: 90)

// Careers
/\/(careers|jobs)\/?$/i
// -> jobs_listing (confidence: 90)

// Legal pages
/\/(privacy|privacy-policy)/i // -> legal_privacy_page (95)
/\/(terms|tos)/i // -> legal_terms_page (95)

// Auth pages
/\/(login|signin)\/?$/i // -> login_page (90)
/\/(signup|register)\/?$/i // -> signup_page (90)

UGC Patterns (priority 11-85)

// Profile pages
/\/@[^\/]+\/?$/i // -> profile_page (90)
/\/(user|users|member|members)\/[^\/]+/i // -> profile_page (85)
/\/(author|authors)\/[^\/]+/i // -> profile_page (85)

// Forums
/\/(forum|forums|community)\/?$/i // -> category_index_page (80)
/\/(thread|topic|discussion)\/[^\/]+/i // -> forum_thread (80)

Platform-Specific Patterns

Location: packages/api/src/data/url-patterns/platforms/

Higher-confidence patterns for known platforms:

Social Platforms (social.js)

// YouTube
youtube: {
domains: /youtube\.com|youtu\.be/i,
patterns: [
{ pattern: /\/watch\?v=[A-Za-z0-9_-]+/i, page_type: "video_page", confidence: 95 },
{ pattern: /\/@[^\/]+\/?$/i, page_type: "profile_page", confidence: 95 },
{ pattern: /\/channel\/[^\/]+/i, page_type: "profile_page", confidence: 95 },
{ pattern: /\/playlist\?list=/i, page_type: "playlist_page", confidence: 95 },
{ pattern: /\/shorts\/[^\/]+/i, page_type: "video_page", confidence: 95 },
]
}

// Reddit
reddit: {
domains: /reddit\.com/i,
patterns: [
{ pattern: /\/r\/[^\/]+\/?$/i, page_type: "category_index_page", confidence: 95 },
{ pattern: /\/r\/[^\/]+\/comments\//i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/user\/[^\/]+/i, page_type: "profile_page", confidence: 90 },
]
}

// Twitter/X
twitter: {
domains: /twitter\.com|x\.com/i,
patterns: [
{ pattern: /\/[^\/]+\/status\/\d+/i, page_type: "social_post", confidence: 95 },
{ pattern: /\/hashtag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
{ pattern: /\/i\/spaces\/[A-Za-z0-9]+/i, page_type: "video_page", confidence: 90 },
]
}

Publishing Platforms (publishing.js)

// Medium
medium: {
domains: /medium\.com/i,
patterns: [
{ pattern: /\/@[^\/]+\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
{ pattern: /\/tag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
]
}

// Substack
substack: {
domains: /[^.]+\.substack\.com/i,
patterns: [
{ pattern: /\/p\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
{ pattern: /\/archive/i, page_type: "category_index_page", confidence: 90 },
]
}

E-commerce Platforms (commerce.js)

// Shopify stores
shopify: {
domains: /[^.]+\.myshopify\.com/i,
patterns: [
{ pattern: /\/products\/[^\/]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/collections\/[^\/]+/i, page_type: "category_page", confidence: 95 },
]
}

// Amazon
amazon: {
domains: /amazon\.(com|co\.uk|de|fr|es|it|ca|jp)/i,
patterns: [
{ pattern: /\/dp\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/gp\/product\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/s\?/i, page_type: "search_results_page", confidence: 90 },
]
}

Developer Tools (dev-tools.js)

// GitHub
github: {
domains: /github\.com/i,
patterns: [
{ pattern: /\/[^\/]+\/[^\/]+\/?$/i, page_type: "repository_page", confidence: 90 },
{ pattern: /\/[^\/]+\/[^\/]+\/issues\/\d+/i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/[^\/]+\/[^\/]+\/pull\/\d+/i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/[^\/]+\/[^\/]+\/blob\//i, page_type: "documentation_page", confidence: 85 },
]
}

// Stack Overflow
stackoverflow: {
domains: /stackoverflow\.com|stackexchange\.com/i,
patterns: [
{ pattern: /\/questions\/\d+/i, page_type: "qna_page", confidence: 95 },
{ pattern: /\/users\/\d+/i, page_type: "profile_page", confidence: 90 },
{ pattern: /\/tags\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
]
}

Priority Rules

Rule Evaluation Order

Rules are evaluated in strict priority order. Lower priority numbers are checked first:

PriorityRule TypeConfidence
N/AOwned domain match100%
N/ADomain database lookup95%
N/ATLD rules (.gov, .edu)95%
N/APlatform detection85%
1-10Homepage patterns95-99%
11-19Profile patterns80-90%
20-29Editorial patterns75-85%
30-50Commercial patterns75-95%
51-70Documentation patterns75-85%
71-100Utility patterns70-95%
100+Fallback patterns70-80%

Conflict Resolution

When multiple patterns could match:

  1. Platform-specific patterns win over generic patterns for known domains
  2. Higher confidence patterns take precedence within the same category
  3. Lower priority number wins when confidence is equal
  4. More specific patterns (longer regex) generally have lower priority numbers
// Example: URL "/blog/my-article" on medium.com

// Platform pattern (checked first for medium.com domain)
medium: { pattern: /\/[^\/]+$/i, page_type: "blog_post", confidence: 90 }

// Generic pattern (would be checked if not Medium)
{ pattern: /\/(blog)\/[^\/]+\/?$/i, page_type: "blog_post", confidence: 75 }

// Result: Platform pattern wins with confidence 90

Structural Type Validation

The rules engine enforces parent-child relationships between structural_type and page_type:

// Valid combinations:
structural_type: "article" -> page_type: "blog_post", "news_article", "press_release"
structural_type: "listing" -> page_type: "category_page", "search_results_page"
structural_type: "detail" -> page_type: "product_page", "profile_page"
structural_type: "thread" -> page_type: "forum_thread", "qna_page"

// If page_type doesn't match structural_type, it's cleared:
if (!isPageTypeValidForStructural(pageType, structuralType)) {
result.page_type = { value: null, confidence: 0, source: null };
}

Adding New Patterns

Step 1: Identify Pattern Need

Analyze URLs that are being misclassified or sent to LLM unnecessarily.

Step 2: Choose Location

Pattern TypeFile Location
Generic path patternurl-patterns/generic-patterns.js
Platform-specificurl-patterns/platforms/{category}.js
New platformCreate in url-patterns/platforms/

Step 3: Write the Pattern

{
name: "podcast_episode", // Descriptive name
pattern: /\/podcasts?\/[^\/]+/i, // Test thoroughly!
page_type: "podcast_episode", // Valid page_type from constants
page_type_category: "editorial", // Category grouping
confidence: 88, // Be conservative
priority: 119, // Higher = checked later
}

Step 4: Add Pattern

For generic patterns, add to the array in generic-patterns.js:

export const GENERIC_PATH_PATTERNS = [
// ... existing patterns ...

// New podcast patterns
{
name: "podcast_episode",
pattern: /\/podcasts?\/[^\/]+/i,
page_type: "podcast_episode",
page_type_category: "editorial",
confidence: 88,
priority: 119,
},
];

For platform-specific, add to the platform's patterns array:

spotify: {
domains: /spotify\.com|open\.spotify\.com/i,
patterns: [
// ... existing ...
{
pattern: /\/episode\/[A-Za-z0-9]+/i,
page_type: "podcast_episode",
page_type_category: "editorial",
confidence: 95,
},
]
}

Step 5: Validate

Patterns are validated on module load against PAGE_TYPES constants:

// validation.js
export function validatePatterns(patterns, name) {
for (const p of patterns) {
if (!PAGE_TYPES[p.page_type.toUpperCase()]) {
throw new Error(`Invalid page_type: ${p.page_type} in ${name}`);
}
}
}

Testing Patterns

Manual Testing

Use the classification endpoint to test individual URLs:

curl -X POST https://api.example.com/admin/classify-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/blog/my-article"}'

Batch Testing

Test patterns against real URLs from your database:

import { classifyWithRules } from "./classifier-rules-engine.js";

const testUrls = [
"https://medium.com/@user/article-title-abc123",
"https://github.com/owner/repo",
"https://example.com/blog/my-post",
];

for (const url of testUrls) {
const result = await classifyWithRules({
url,
domain: new URL(url).hostname,
});
console.log(url);
console.log(" page_type:", result.page_type);
console.log(" rules_applied:", result.rules_applied);
}

Regex Testing Tips

// Test your regex patterns:
const pattern = /\/(blog|blogs)\/[^\/]+\/?$/i;

// Should match:
console.log(pattern.test("/blog/my-article")); // true
console.log(pattern.test("/blogs/my-article/")); // true
console.log(pattern.test("/BLOG/Article-Title")); // true (case insensitive)

// Should NOT match:
console.log(pattern.test("/blog/")); // false (no slug)
console.log(pattern.test("/blog/cat/article")); // false (nested path)
console.log(pattern.test("/myblog/article")); // false (wrong prefix)

Unit Test Example

import { getGenericPathMatch } from "./url-patterns/index.js";

describe("Generic Path Patterns", () => {
it("should match blog posts", () => {
const result = getGenericPathMatch("https://example.com/blog/my-article");
expect(result.page_type).toBe("blog_post");
expect(result.confidence).toBeGreaterThanOrEqual(70);
});

it("should match pricing pages", () => {
const result = getGenericPathMatch("https://example.com/pricing");
expect(result.page_type).toBe("pricing_page");
expect(result.confidence).toBe(90);
});
});

Performance

Lazy Loading

The domain database (858KB) is lazy-loaded to avoid cold start penalty:

// Not loaded on module initialization
let domainDbModule = null;

// Only loaded when first classification runs
async function getDomainDb() {
if (!domainDbModule) {
domainDbModule = await import("../../data/domain-database.js");
}
return domainDbModule;
}

Pattern Compilation

Regex patterns are compiled once at module load time, not per-request:

// Patterns are RegExp objects, not strings
export const GENERIC_PATH_PATTERNS = [
{
pattern: /\/(blog|blogs)\/[^\/]+\/?$/i, // Pre-compiled
// ...
}
];

Early Exit

The classifier exits early when high confidence is reached:

// If domain database returns 95% confidence, skip expensive stages
if (minCoreConfidence >= 80 && hasCoreCassifications) {
result.needs_vectorize = false;
result.needs_llm = false;
}

Pattern Stats

Track pattern coverage:

import { PATTERN_STATS } from "./url-patterns/index.js";

console.log(PATTERN_STATS);
// {
// generic: 180,
// platforms: 45,
// platformPatterns: 520,
// platformBreakdown: {
// social: 12,
// publishing: 8,
// commerce: 10,
// devTools: 6,
// media: 5,
// jobs: 4,
// local: 5,
// other: 5
// }
// }

Common Pattern Examples

Most Useful Generic Patterns

// Homepage (99% confidence)
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/

// Blog post (75% confidence)
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-article, /blogs/great-post/

// Product page (80% confidence)
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes

// Documentation (80% confidence)
/\/(docs?|documentation)\/[^\/]+/i
// Matches: /docs/getting-started, /documentation/api/auth

// Pricing page (90% confidence)
/\/(pricing|plans|packages)\/?$/i
// Matches: /pricing, /plans/, /packages

// Legal pages (95% confidence)
/\/(privacy|privacy-policy|terms|tos)/i
// Matches: /privacy, /privacy-policy, /terms-of-service

// Profile page (85-90% confidence)
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/

/\/(user|users|member|members)\/[^\/]+/i
// Matches: /user/john, /members/12345

// Date-based news (85% confidence)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news

// Listicle (85-90% confidence)
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices

// Location page (80-85% confidence)
/\/locations?\/[^\/]+/i
// Matches: /location/new-york, /locations/california

// State-based location (85% confidence)
/\/[a-z-]+-(?:ca|tx|ny|fl|wa)\/?$/i
// Matches: /houston-tx/, /seattle-wa/

Platform-Specific Pattern Examples

// YouTube video (95%)
/youtube\.com\/watch\?v=[A-Za-z0-9_-]+/i
// Matches: youtube.com/watch?v=dQw4w9WgXcQ

// GitHub repo (90%)
/github\.com\/[^\/]+\/[^\/]+\/?$/i
// Matches: github.com/owner/repo

// Reddit thread (95%)
/reddit\.com\/r\/[^\/]+\/comments\//i
// Matches: reddit.com/r/programming/comments/abc123/title

// Twitter/X post (95%)
/(twitter|x)\.com\/[^\/]+\/status\/\d+/i
// Matches: twitter.com/user/status/123456789

// Stack Overflow question (95%)
/stackoverflow\.com\/questions\/\d+/i
// Matches: stackoverflow.com/questions/12345/how-to-do-x

// Medium article (95%)
/medium\.com\/@[^\/]+\/[^\/]+-[a-f0-9]+$/i
// Matches: medium.com/@user/article-title-abc123def

// Amazon product (95%)
/amazon\.com\/dp\/[A-Z0-9]+/i
// Matches: amazon.com/dp/B08N5WRWNW

Source Files

FilePurpose
src/lib/classification/classifier-rules-engine.jsMain rules engine
src/lib/classification/classification-constants.jsEnums, mappings, helpers
src/data/domain-database.jsGenerated domain classifications
src/data/url-patterns/index.jsPattern aggregator
src/data/url-patterns/generic-patterns.jsPlatform-agnostic patterns
src/data/url-patterns/platforms/*.jsPlatform-specific patterns
src/data/url-patterns/validation.jsPattern validation