Rules Engine Documentation
The Rules Engine is Stage 1 of the classification pipeline - a fast, free classification system based on pattern matching. It classifies URLs by analyzing domains, TLDs, URL paths, and known patterns before falling back to more expensive stages (Vectorize, LLM).
Overview
How Pattern Matching Works
The rules engine processes URLs through a series of cascading rules, each with associated confidence scores. Rules are evaluated in priority order, and the first high-confidence match typically wins.
URL Input
|
v
+------------------+
| 1. Owned Domain | --> 100% confidence if URL matches target_domain
+------------------+
|
v
+------------------+
| 2. Domain DB | --> 95% confidence (4,400+ curated domains)
+------------------+
|
v
+------------------+
| 3. TLD Rules | --> 95% confidence (.gov, .edu, etc.)
+------------------+
|
v
+------------------+
| 4. Platform | --> 85% confidence (YouTube, Reddit, etc.)
+------------------+
|
v
+------------------+
| 5. URL Patterns | --> 70-90% confidence (path-based detection)
+------------------+
|
v
+------------------+
| 6. Spam Signals | --> 70-85% confidence (risk detection)
+------------------+
|
v
Classification Result (or pass to Vectorize/LLM)
Classification Output Structure
Each dimension is classified independently with its own confidence:
{
domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
tier1_type: { value: "platform", confidence: 95, source: "domain_database_v2" },
structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" },
channel_bucket: { value: "pr_earned_media", confidence: 80, source: "derived_from_domain_type" },
page_type: { value: "blog_post", confidence: 75, source: "url_path_pattern" },
quality_tier: { value: "tier_1", confidence: 100, source: "domain_rank" },
modifiers: ["contextual_link"],
classification_source: "rule",
final_confidence: 95, // Max of all dimensions
rules_applied: ["domain_database_v2", "url_path_pattern"],
needs_vectorize: false,
needs_llm: false
}
Domain Database
The domain database contains 4,400+ curated domain classifications, providing high-confidence (95%) instant classification for known domains.
Database Structure
Location: packages/api/src/data/domain-database.js
Generated from: classification-data/domains/_master.csv
export const DOMAIN_DATABASE = {
"salesforce.com": {
domain_type: "saas_product",
tier1_type: "platform",
industry: "crm",
subcategory: "enterprise_crm",
status: "covered",
source: "v2_master",
},
"nytimes.com": {
domain_type: "news_publisher",
tier1_type: "information",
industry: "news",
subcategory: "general",
status: "covered",
source: "v2_master",
},
// ... 4,400+ more domains
};
Domain Type Distribution
| Domain Type | Count | Examples |
|---|---|---|
saas_product | 1,639 | Slack, Notion, Salesforce |
ecommerce_store | 536 | Nike.com, Warby Parker |
government_site | 257 | IRS.gov, CDC.gov |
travel_booking | 256 | Delta, Marriott |
financial_institution | 245 | Chase, PayPal |
content_publisher | 200+ | BuzzFeed, HuffPost |
healthcare_institution | 150+ | Mayo Clinic, Cleveland Clinic |
Lazy Loading
The domain database is lazy-loaded to avoid 858KB cold start penalty:
let domainDbModule = null;
async function getDomainDb() {
if (!domainDbModule) {
domainDbModule = await import("../../data/domain-database.js");
}
return domainDbModule;
}
Adding Domains to Database
- Edit
classification-data/domains/_master.csv - Run
npm run build:domains - The generated
domain-database.jsis auto-generated
URL Patterns
Pattern Syntax
URL patterns use JavaScript regular expressions with specific conventions:
{
name: "blog_post", // Human-readable identifier
pattern: /\/(blog|blogs)\/[^\/]+\/?$/i, // Regex pattern
page_type: "blog_post", // Classification result
page_type_category: "editorial", // Category grouping
confidence: 75, // 0-100 score
priority: 21, // Lower = checked first
}
Pattern Components
| Component | Description | Example |
|---|---|---|
^ | Start of string | ^https?:\/\/ |
$ | End of string | \/?$/ |
\/ | Literal slash | \/blog\/ |
[^\/]+ | Any chars except slash | /blog/[^\/]+/ |
\d+ | One or more digits | /post/\d+ |
[a-z]{2} | Exactly 2 letters | /en/, /fr/ |
(?:...) | Non-capturing group | `(?:blog |
(...) | Capturing group | Avoid in patterns |
i | Case insensitive flag | /BLOG/i matches /blog/ |
Common Pattern Examples
// Homepage detection
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/
// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-post/, /blogs/article-title
// Date-based article (news sites)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news
// Profile page with @ symbol
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/
// Product page with ID
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes/
// Listicle pattern
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices-for-...
// State-based location page
/\/[a-z-]+-(?:ca|tx|ny|fl)\/?$/i
// Matches: /houston-tx/, /los-angeles-ca/
Pattern Categories
Generic Path Patterns
Location: packages/api/src/data/url-patterns/generic-patterns.js
Platform-agnostic patterns that work across any website:
Editorial Patterns (priority 20-29)
// Blog/Article index
/\/(blog|blogs|articles?|news|posts?|stories)\/?$/i
// -> category_index_page (confidence: 80)
// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// -> blog_post (confidence: 75)
// Date-based articles
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// -> news_article (confidence: 85)
// Press releases
/\/(press-releases?|news-releases?|newsroom)\/[^\/]+/i
// -> press_release (confidence: 85)
// Case studies
/\/(case-stud(y|ies)|success-stor(y|ies))\/[^\/]+/i
// -> case_study_page (confidence: 80)
Listicle Patterns (priority 29)
// Number-prefixed listicles
/\/top-\d+[-\/]/i // -> listicle (90)
/\/best-\d+[-\/]/i // -> listicle (90)
/\/\d+-best[-\/]/i // -> listicle (90)
/\/\d+-ways[-\/]/i // -> listicle (85)
/\/\d+-tips[-\/]/i // -> listicle (85)
/\/\d+-reasons[-\/]/i // -> listicle (85)
/\/\d+-tools[-\/]/i // -> listicle (80)
// Comparison patterns
/\/-vs-/i // -> comparison_page (85)
/\/-versus-/i // -> comparison_page (85)
Commercial Patterns (priority 31-50)
// Product pages
/\/(products?)\/[^\/]+\/?$/i
// -> product_page (confidence: 80)
// Category pages
/\/(categor(y|ies)|collections?)\/[^\/]+/i
// -> category_page (confidence: 75-80)
// Pricing
/\/(pricing|plans|packages)\/?$/i
// -> pricing_page (confidence: 90)
// Checkout
/\/(cart|basket|checkout)\/?$/i
// -> checkout_page (confidence: 95)
// Landing pages
/\/(get-started|demo|request-demo|contact-sales)\/?$/i
// -> landing_page (confidence: 85)
Documentation Patterns (priority 51-65)
// Docs index
/\/(docs?|documentation)\/?$/i
// -> category_index_page (confidence: 85)
// Doc pages
/\/(docs?|documentation)\/[^\/]+/i
// -> documentation_page (confidence: 80)
// API reference
/\/(api-?(docs|reference)?)\/[^\/]+/i
// -> api_reference_page (confidence: 85)
// Guides/tutorials
/\/(guides?|tutorials?)\/[^\/]+/i
// -> howto_article (confidence: 80)
// FAQ
/\/(faq|frequently-asked)\/?/i
// -> faq_page (confidence: 85-90)
Utility Patterns (priority 70-95)
// About pages
/\/(about|about-us|company)\/?$/i
// -> about_page (confidence: 85)
// Contact
/\/(contact|contact-us)\/?$/i
// -> contact_page (confidence: 90)
// Careers
/\/(careers|jobs)\/?$/i
// -> jobs_listing (confidence: 90)
// Legal pages
/\/(privacy|privacy-policy)/i // -> legal_privacy_page (95)
/\/(terms|tos)/i // -> legal_terms_page (95)
// Auth pages
/\/(login|signin)\/?$/i // -> login_page (90)
/\/(signup|register)\/?$/i // -> signup_page (90)
UGC Patterns (priority 11-85)
// Profile pages
/\/@[^\/]+\/?$/i // -> profile_page (90)
/\/(user|users|member|members)\/[^\/]+/i // -> profile_page (85)
/\/(author|authors)\/[^\/]+/i // -> profile_page (85)
// Forums
/\/(forum|forums|community)\/?$/i // -> category_index_page (80)
/\/(thread|topic|discussion)\/[^\/]+/i // -> forum_thread (80)
Platform-Specific Patterns
Location: packages/api/src/data/url-patterns/platforms/
Higher-confidence patterns for known platforms:
Social Platforms (social.js)
// YouTube
youtube: {
domains: /youtube\.com|youtu\.be/i,
patterns: [
{ pattern: /\/watch\?v=[A-Za-z0-9_-]+/i, page_type: "video_page", confidence: 95 },
{ pattern: /\/@[^\/]+\/?$/i, page_type: "profile_page", confidence: 95 },
{ pattern: /\/channel\/[^\/]+/i, page_type: "profile_page", confidence: 95 },
{ pattern: /\/playlist\?list=/i, page_type: "playlist_page", confidence: 95 },
{ pattern: /\/shorts\/[^\/]+/i, page_type: "video_page", confidence: 95 },
]
}
// Reddit
reddit: {
domains: /reddit\.com/i,
patterns: [
{ pattern: /\/r\/[^\/]+\/?$/i, page_type: "category_index_page", confidence: 95 },
{ pattern: /\/r\/[^\/]+\/comments\//i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/user\/[^\/]+/i, page_type: "profile_page", confidence: 90 },
]
}
// Twitter/X
twitter: {
domains: /twitter\.com|x\.com/i,
patterns: [
{ pattern: /\/[^\/]+\/status\/\d+/i, page_type: "social_post", confidence: 95 },
{ pattern: /\/hashtag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
{ pattern: /\/i\/spaces\/[A-Za-z0-9]+/i, page_type: "video_page", confidence: 90 },
]
}
Publishing Platforms (publishing.js)
// Medium
medium: {
domains: /medium\.com/i,
patterns: [
{ pattern: /\/@[^\/]+\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
{ pattern: /\/tag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
]
}
// Substack
substack: {
domains: /[^.]+\.substack\.com/i,
patterns: [
{ pattern: /\/p\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
{ pattern: /\/archive/i, page_type: "category_index_page", confidence: 90 },
]
}
E-commerce Platforms (commerce.js)
// Shopify stores
shopify: {
domains: /[^.]+\.myshopify\.com/i,
patterns: [
{ pattern: /\/products\/[^\/]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/collections\/[^\/]+/i, page_type: "category_page", confidence: 95 },
]
}
// Amazon
amazon: {
domains: /amazon\.(com|co\.uk|de|fr|es|it|ca|jp)/i,
patterns: [
{ pattern: /\/dp\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/gp\/product\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
{ pattern: /\/s\?/i, page_type: "search_results_page", confidence: 90 },
]
}
Developer Tools (dev-tools.js)
// GitHub
github: {
domains: /github\.com/i,
patterns: [
{ pattern: /\/[^\/]+\/[^\/]+\/?$/i, page_type: "repository_page", confidence: 90 },
{ pattern: /\/[^\/]+\/[^\/]+\/issues\/\d+/i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/[^\/]+\/[^\/]+\/pull\/\d+/i, page_type: "forum_thread", confidence: 95 },
{ pattern: /\/[^\/]+\/[^\/]+\/blob\//i, page_type: "documentation_page", confidence: 85 },
]
}
// Stack Overflow
stackoverflow: {
domains: /stackoverflow\.com|stackexchange\.com/i,
patterns: [
{ pattern: /\/questions\/\d+/i, page_type: "qna_page", confidence: 95 },
{ pattern: /\/users\/\d+/i, page_type: "profile_page", confidence: 90 },
{ pattern: /\/tags\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
]
}
Priority Rules
Rule Evaluation Order
Rules are evaluated in strict priority order. Lower priority numbers are checked first:
| Priority | Rule Type | Confidence |
|---|---|---|
| N/A | Owned domain match | 100% |
| N/A | Domain database lookup | 95% |
| N/A | TLD rules (.gov, .edu) | 95% |
| N/A | Platform detection | 85% |
| 1-10 | Homepage patterns | 95-99% |
| 11-19 | Profile patterns | 80-90% |
| 20-29 | Editorial patterns | 75-85% |
| 30-50 | Commercial patterns | 75-95% |
| 51-70 | Documentation patterns | 75-85% |
| 71-100 | Utility patterns | 70-95% |
| 100+ | Fallback patterns | 70-80% |
Conflict Resolution
When multiple patterns could match:
- Platform-specific patterns win over generic patterns for known domains
- Higher confidence patterns take precedence within the same category
- Lower priority number wins when confidence is equal
- More specific patterns (longer regex) generally have lower priority numbers
// Example: URL "/blog/my-article" on medium.com
// Platform pattern (checked first for medium.com domain)
medium: { pattern: /\/[^\/]+$/i, page_type: "blog_post", confidence: 90 }
// Generic pattern (would be checked if not Medium)
{ pattern: /\/(blog)\/[^\/]+\/?$/i, page_type: "blog_post", confidence: 75 }
// Result: Platform pattern wins with confidence 90
Structural Type Validation
The rules engine enforces parent-child relationships between structural_type and page_type:
// Valid combinations:
structural_type: "article" -> page_type: "blog_post", "news_article", "press_release"
structural_type: "listing" -> page_type: "category_page", "search_results_page"
structural_type: "detail" -> page_type: "product_page", "profile_page"
structural_type: "thread" -> page_type: "forum_thread", "qna_page"
// If page_type doesn't match structural_type, it's cleared:
if (!isPageTypeValidForStructural(pageType, structuralType)) {
result.page_type = { value: null, confidence: 0, source: null };
}
Adding New Patterns
Step 1: Identify Pattern Need
Analyze URLs that are being misclassified or sent to LLM unnecessarily.
Step 2: Choose Location
| Pattern Type | File Location |
|---|---|
| Generic path pattern | url-patterns/generic-patterns.js |
| Platform-specific | url-patterns/platforms/{category}.js |
| New platform | Create in url-patterns/platforms/ |
Step 3: Write the Pattern
{
name: "podcast_episode", // Descriptive name
pattern: /\/podcasts?\/[^\/]+/i, // Test thoroughly!
page_type: "podcast_episode", // Valid page_type from constants
page_type_category: "editorial", // Category grouping
confidence: 88, // Be conservative
priority: 119, // Higher = checked later
}
Step 4: Add Pattern
For generic patterns, add to the array in generic-patterns.js:
export const GENERIC_PATH_PATTERNS = [
// ... existing patterns ...
// New podcast patterns
{
name: "podcast_episode",
pattern: /\/podcasts?\/[^\/]+/i,
page_type: "podcast_episode",
page_type_category: "editorial",
confidence: 88,
priority: 119,
},
];
For platform-specific, add to the platform's patterns array:
spotify: {
domains: /spotify\.com|open\.spotify\.com/i,
patterns: [
// ... existing ...
{
pattern: /\/episode\/[A-Za-z0-9]+/i,
page_type: "podcast_episode",
page_type_category: "editorial",
confidence: 95,
},
]
}
Step 5: Validate
Patterns are validated on module load against PAGE_TYPES constants:
// validation.js
export function validatePatterns(patterns, name) {
for (const p of patterns) {
if (!PAGE_TYPES[p.page_type.toUpperCase()]) {
throw new Error(`Invalid page_type: ${p.page_type} in ${name}`);
}
}
}
Testing Patterns
Manual Testing
Use the classification endpoint to test individual URLs:
curl -X POST https://api.example.com/admin/classify-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/blog/my-article"}'
Batch Testing
Test patterns against real URLs from your database:
import { classifyWithRules } from "./classifier-rules-engine.js";
const testUrls = [
"https://medium.com/@user/article-title-abc123",
"https://github.com/owner/repo",
"https://example.com/blog/my-post",
];
for (const url of testUrls) {
const result = await classifyWithRules({
url,
domain: new URL(url).hostname,
});
console.log(url);
console.log(" page_type:", result.page_type);
console.log(" rules_applied:", result.rules_applied);
}
Regex Testing Tips
// Test your regex patterns:
const pattern = /\/(blog|blogs)\/[^\/]+\/?$/i;
// Should match:
console.log(pattern.test("/blog/my-article")); // true
console.log(pattern.test("/blogs/my-article/")); // true
console.log(pattern.test("/BLOG/Article-Title")); // true (case insensitive)
// Should NOT match:
console.log(pattern.test("/blog/")); // false (no slug)
console.log(pattern.test("/blog/cat/article")); // false (nested path)
console.log(pattern.test("/myblog/article")); // false (wrong prefix)
Unit Test Example
import { getGenericPathMatch } from "./url-patterns/index.js";
describe("Generic Path Patterns", () => {
it("should match blog posts", () => {
const result = getGenericPathMatch("https://example.com/blog/my-article");
expect(result.page_type).toBe("blog_post");
expect(result.confidence).toBeGreaterThanOrEqual(70);
});
it("should match pricing pages", () => {
const result = getGenericPathMatch("https://example.com/pricing");
expect(result.page_type).toBe("pricing_page");
expect(result.confidence).toBe(90);
});
});
Performance
Lazy Loading
The domain database (858KB) is lazy-loaded to avoid cold start penalty:
// Not loaded on module initialization
let domainDbModule = null;
// Only loaded when first classification runs
async function getDomainDb() {
if (!domainDbModule) {
domainDbModule = await import("../../data/domain-database.js");
}
return domainDbModule;
}
Pattern Compilation
Regex patterns are compiled once at module load time, not per-request:
// Patterns are RegExp objects, not strings
export const GENERIC_PATH_PATTERNS = [
{
pattern: /\/(blog|blogs)\/[^\/]+\/?$/i, // Pre-compiled
// ...
}
];
Early Exit
The classifier exits early when high confidence is reached:
// If domain database returns 95% confidence, skip expensive stages
if (minCoreConfidence >= 80 && hasCoreCassifications) {
result.needs_vectorize = false;
result.needs_llm = false;
}
Pattern Stats
Track pattern coverage:
import { PATTERN_STATS } from "./url-patterns/index.js";
console.log(PATTERN_STATS);
// {
// generic: 180,
// platforms: 45,
// platformPatterns: 520,
// platformBreakdown: {
// social: 12,
// publishing: 8,
// commerce: 10,
// devTools: 6,
// media: 5,
// jobs: 4,
// local: 5,
// other: 5
// }
// }
Common Pattern Examples
Most Useful Generic Patterns
// Homepage (99% confidence)
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/
// Blog post (75% confidence)
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-article, /blogs/great-post/
// Product page (80% confidence)
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes
// Documentation (80% confidence)
/\/(docs?|documentation)\/[^\/]+/i
// Matches: /docs/getting-started, /documentation/api/auth
// Pricing page (90% confidence)
/\/(pricing|plans|packages)\/?$/i
// Matches: /pricing, /plans/, /packages
// Legal pages (95% confidence)
/\/(privacy|privacy-policy|terms|tos)/i
// Matches: /privacy, /privacy-policy, /terms-of-service
// Profile page (85-90% confidence)
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/
/\/(user|users|member|members)\/[^\/]+/i
// Matches: /user/john, /members/12345
// Date-based news (85% confidence)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news
// Listicle (85-90% confidence)
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices
// Location page (80-85% confidence)
/\/locations?\/[^\/]+/i
// Matches: /location/new-york, /locations/california
// State-based location (85% confidence)
/\/[a-z-]+-(?:ca|tx|ny|fl|wa)\/?$/i
// Matches: /houston-tx/, /seattle-wa/
Platform-Specific Pattern Examples
// YouTube video (95%)
/youtube\.com\/watch\?v=[A-Za-z0-9_-]+/i
// Matches: youtube.com/watch?v=dQw4w9WgXcQ
// GitHub repo (90%)
/github\.com\/[^\/]+\/[^\/]+\/?$/i
// Matches: github.com/owner/repo
// Reddit thread (95%)
/reddit\.com\/r\/[^\/]+\/comments\//i
// Matches: reddit.com/r/programming/comments/abc123/title
// Twitter/X post (95%)
/(twitter|x)\.com\/[^\/]+\/status\/\d+/i
// Matches: twitter.com/user/status/123456789
// Stack Overflow question (95%)
/stackoverflow\.com\/questions\/\d+/i
// Matches: stackoverflow.com/questions/12345/how-to-do-x
// Medium article (95%)
/medium\.com\/@[^\/]+\/[^\/]+-[a-f0-9]+$/i
// Matches: medium.com/@user/article-title-abc123def
// Amazon product (95%)
/amazon\.com\/dp\/[A-Z0-9]+/i
// Matches: amazon.com/dp/B08N5WRWNW
Source Files
| File | Purpose |
|---|---|
src/lib/classification/classifier-rules-engine.js | Main rules engine |
src/lib/classification/classification-constants.js | Enums, mappings, helpers |
src/data/domain-database.js | Generated domain classifications |
src/data/url-patterns/index.js | Pattern aggregator |
src/data/url-patterns/generic-patterns.js | Platform-agnostic patterns |
src/data/url-patterns/platforms/*.js | Platform-specific patterns |
src/data/url-patterns/validation.js | Pattern validation |
Related Documentation
- Classification Dimensions - All valid classification values
- Domain Classification - Domain classification pipeline
- URL Classification - Full classification pipeline
- Pipeline Plan - Multi-stage classification overview