Rules Engine Documentation

The Rules Engine is Stage 1 of the classification pipeline - a fast, free classification system based on pattern matching. It classifies URLs by analyzing domains, TLDs, URL paths, and known patterns before falling back to more expensive stages (Vectorize, LLM).

Overview

How Pattern Matching Works

The rules engine processes URLs through a series of cascading rules, each with associated confidence scores. Rules are evaluated in priority order, and the first high-confidence match typically wins.

URL Input
    |
    v
+------------------+
| 1. Owned Domain  | --> 100% confidence if URL matches target_domain
+------------------+
    |
    v
+------------------+
| 2. Domain DB     | --> 95% confidence (4,400+ curated domains)
+------------------+
    |
    v
+------------------+
| 3. TLD Rules     | --> 95% confidence (.gov, .edu, etc.)
+------------------+
    |
    v
+------------------+
| 4. Platform      | --> 85% confidence (YouTube, Reddit, etc.)
+------------------+
    |
    v
+------------------+
| 5. URL Patterns  | --> 70-90% confidence (path-based detection)
+------------------+
    |
    v
+------------------+
| 6. Spam Signals  | --> 70-85% confidence (risk detection)
+------------------+
    |
    v
Classification Result (or pass to Vectorize/LLM)

Classification Output Structure

Each dimension is classified independently with its own confidence:

{
  domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
  tier1_type: { value: "platform", confidence: 95, source: "domain_database_v2" },
  structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" },
  channel_bucket: { value: "pr_earned_media", confidence: 80, source: "derived_from_domain_type" },
  page_type: { value: "blog_post", confidence: 75, source: "url_path_pattern" },
  quality_tier: { value: "tier_1", confidence: 100, source: "domain_rank" },
  modifiers: ["contextual_link"],
  
  classification_source: "rule",
  final_confidence: 95,  // Max of all dimensions
  rules_applied: ["domain_database_v2", "url_path_pattern"],
  needs_vectorize: false,
  needs_llm: false
}

Domain Database

The domain database contains 4,400+ curated domain classifications, providing high-confidence (95%) instant classification for known domains.

Database Structure

Location: packages/api/src/data/domain-database.js

Generated from: classification-data/domains/_master.csv

export const DOMAIN_DATABASE = {
  "salesforce.com": {
    domain_type: "saas_product",
    tier1_type: "platform",
    industry: "crm",
    subcategory: "enterprise_crm",
    status: "covered",
    source: "v2_master",
  },
  "nytimes.com": {
    domain_type: "news_publisher",
    tier1_type: "information",
    industry: "news",
    subcategory: "general",
    status: "covered",
    source: "v2_master",
  },
  // ... 4,400+ more domains
};

Domain Type Distribution

Domain Type	Count	Examples
`saas_product`	1,639	Slack, Notion, Salesforce
`ecommerce_store`	536	Nike.com, Warby Parker
`government_site`	257	IRS.gov, CDC.gov
`travel_booking`	256	Delta, Marriott
`financial_institution`	245	Chase, PayPal
`content_publisher`	200+	BuzzFeed, HuffPost
`healthcare_institution`	150+	Mayo Clinic, Cleveland Clinic

Lazy Loading

The domain database is lazy-loaded to avoid 858KB cold start penalty:

let domainDbModule = null;

async function getDomainDb() {
  if (!domainDbModule) {
    domainDbModule = await import("../../data/domain-database.js");
  }
  return domainDbModule;
}

Adding Domains to Database

Edit classification-data/domains/_master.csv
Run npm run build:domains
The generated domain-database.js is auto-generated

URL Patterns

Pattern Syntax

URL patterns use JavaScript regular expressions with specific conventions:

{
  name: "blog_post",                    // Human-readable identifier
  pattern: /\/(blog|blogs)\/[^\/]+\/?$/i,  // Regex pattern
  page_type: "blog_post",               // Classification result
  page_type_category: "editorial",      // Category grouping
  confidence: 75,                       // 0-100 score
  priority: 21,                         // Lower = checked first
}

Pattern Components

Component	Description	Example
`^`	Start of string	`^https?:\/\/`
`$`	End of string	`\/?$/`
`\/`	Literal slash	`\/blog\/`
`[^\/]+`	Any chars except slash	`/blog/[^\/]+/`
`\d+`	One or more digits	`/post/\d+`
`[a-z]{2}`	Exactly 2 letters	`/en/`, `/fr/`
`(?:...)`	Non-capturing group	`(?:blog
`(...)`	Capturing group	Avoid in patterns
`i`	Case insensitive flag	`/BLOG/i` matches `/blog/`

Common Pattern Examples

// Homepage detection
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/

// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-post/, /blogs/article-title

// Date-based article (news sites)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news

// Profile page with @ symbol
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/

// Product page with ID
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes/

// Listicle pattern
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices-for-...

// State-based location page
/\/[a-z-]+-(?:ca|tx|ny|fl)\/?$/i
// Matches: /houston-tx/, /los-angeles-ca/

Pattern Categories

Generic Path Patterns

Location: packages/api/src/data/url-patterns/generic-patterns.js

Platform-agnostic patterns that work across any website:

Editorial Patterns (priority 20-29)

// Blog/Article index
/\/(blog|blogs|articles?|news|posts?|stories)\/?$/i
// -> category_index_page (confidence: 80)

// Blog post with slug
/\/(blog|blogs)\/[^\/]+\/?$/i
// -> blog_post (confidence: 75)

// Date-based articles
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// -> news_article (confidence: 85)

// Press releases
/\/(press-releases?|news-releases?|newsroom)\/[^\/]+/i
// -> press_release (confidence: 85)

// Case studies
/\/(case-stud(y|ies)|success-stor(y|ies))\/[^\/]+/i
// -> case_study_page (confidence: 80)

Listicle Patterns (priority 29)

// Number-prefixed listicles
/\/top-\d+[-\/]/i      // -> listicle (90)
/\/best-\d+[-\/]/i     // -> listicle (90)
/\/\d+-best[-\/]/i     // -> listicle (90)
/\/\d+-ways[-\/]/i     // -> listicle (85)
/\/\d+-tips[-\/]/i     // -> listicle (85)
/\/\d+-reasons[-\/]/i  // -> listicle (85)
/\/\d+-tools[-\/]/i    // -> listicle (80)

// Comparison patterns
/\/-vs-/i              // -> comparison_page (85)
/\/-versus-/i          // -> comparison_page (85)

Commercial Patterns (priority 31-50)

// Product pages
/\/(products?)\/[^\/]+\/?$/i
// -> product_page (confidence: 80)

// Category pages
/\/(categor(y|ies)|collections?)\/[^\/]+/i
// -> category_page (confidence: 75-80)

// Pricing
/\/(pricing|plans|packages)\/?$/i
// -> pricing_page (confidence: 90)

// Checkout
/\/(cart|basket|checkout)\/?$/i
// -> checkout_page (confidence: 95)

// Landing pages
/\/(get-started|demo|request-demo|contact-sales)\/?$/i
// -> landing_page (confidence: 85)

Documentation Patterns (priority 51-65)

// Docs index
/\/(docs?|documentation)\/?$/i
// -> category_index_page (confidence: 85)

// Doc pages
/\/(docs?|documentation)\/[^\/]+/i
// -> documentation_page (confidence: 80)

// API reference
/\/(api-?(docs|reference)?)\/[^\/]+/i
// -> api_reference_page (confidence: 85)

// Guides/tutorials
/\/(guides?|tutorials?)\/[^\/]+/i
// -> howto_article (confidence: 80)

// FAQ
/\/(faq|frequently-asked)\/?/i
// -> faq_page (confidence: 85-90)

Utility Patterns (priority 70-95)

// About pages
/\/(about|about-us|company)\/?$/i
// -> about_page (confidence: 85)

// Contact
/\/(contact|contact-us)\/?$/i
// -> contact_page (confidence: 90)

// Careers
/\/(careers|jobs)\/?$/i
// -> jobs_listing (confidence: 90)

// Legal pages
/\/(privacy|privacy-policy)/i   // -> legal_privacy_page (95)
/\/(terms|tos)/i                // -> legal_terms_page (95)

// Auth pages
/\/(login|signin)\/?$/i         // -> login_page (90)
/\/(signup|register)\/?$/i      // -> signup_page (90)

UGC Patterns (priority 11-85)

// Profile pages
/\/@[^\/]+\/?$/i                       // -> profile_page (90)
/\/(user|users|member|members)\/[^\/]+/i  // -> profile_page (85)
/\/(author|authors)\/[^\/]+/i          // -> profile_page (85)

// Forums
/\/(forum|forums|community)\/?$/i      // -> category_index_page (80)
/\/(thread|topic|discussion)\/[^\/]+/i // -> forum_thread (80)

Platform-Specific Patterns

Location: packages/api/src/data/url-patterns/platforms/

Higher-confidence patterns for known platforms:

Social Platforms (`social.js`)

// YouTube
youtube: {
  domains: /youtube\.com|youtu\.be/i,
  patterns: [
    { pattern: /\/watch\?v=[A-Za-z0-9_-]+/i, page_type: "video_page", confidence: 95 },
    { pattern: /\/@[^\/]+\/?$/i, page_type: "profile_page", confidence: 95 },
    { pattern: /\/channel\/[^\/]+/i, page_type: "profile_page", confidence: 95 },
    { pattern: /\/playlist\?list=/i, page_type: "playlist_page", confidence: 95 },
    { pattern: /\/shorts\/[^\/]+/i, page_type: "video_page", confidence: 95 },
  ]
}

// Reddit
reddit: {
  domains: /reddit\.com/i,
  patterns: [
    { pattern: /\/r\/[^\/]+\/?$/i, page_type: "category_index_page", confidence: 95 },
    { pattern: /\/r\/[^\/]+\/comments\//i, page_type: "forum_thread", confidence: 95 },
    { pattern: /\/user\/[^\/]+/i, page_type: "profile_page", confidence: 90 },
  ]
}

// Twitter/X
twitter: {
  domains: /twitter\.com|x\.com/i,
  patterns: [
    { pattern: /\/[^\/]+\/status\/\d+/i, page_type: "social_post", confidence: 95 },
    { pattern: /\/hashtag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
    { pattern: /\/i\/spaces\/[A-Za-z0-9]+/i, page_type: "video_page", confidence: 90 },
  ]
}

Publishing Platforms (`publishing.js`)

// Medium
medium: {
  domains: /medium\.com/i,
  patterns: [
    { pattern: /\/@[^\/]+\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
    { pattern: /\/tag\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
  ]
}

// Substack
substack: {
  domains: /[^.]+\.substack\.com/i,
  patterns: [
    { pattern: /\/p\/[^\/]+/i, page_type: "blog_post", confidence: 95 },
    { pattern: /\/archive/i, page_type: "category_index_page", confidence: 90 },
  ]
}

E-commerce Platforms (`commerce.js`)

// Shopify stores
shopify: {
  domains: /[^.]+\.myshopify\.com/i,
  patterns: [
    { pattern: /\/products\/[^\/]+/i, page_type: "product_page", confidence: 95 },
    { pattern: /\/collections\/[^\/]+/i, page_type: "category_page", confidence: 95 },
  ]
}

// Amazon
amazon: {
  domains: /amazon\.(com|co\.uk|de|fr|es|it|ca|jp)/i,
  patterns: [
    { pattern: /\/dp\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
    { pattern: /\/gp\/product\/[A-Z0-9]+/i, page_type: "product_page", confidence: 95 },
    { pattern: /\/s\?/i, page_type: "search_results_page", confidence: 90 },
  ]
}

Developer Tools (`dev-tools.js`)

// GitHub
github: {
  domains: /github\.com/i,
  patterns: [
    { pattern: /\/[^\/]+\/[^\/]+\/?$/i, page_type: "repository_page", confidence: 90 },
    { pattern: /\/[^\/]+\/[^\/]+\/issues\/\d+/i, page_type: "forum_thread", confidence: 95 },
    { pattern: /\/[^\/]+\/[^\/]+\/pull\/\d+/i, page_type: "forum_thread", confidence: 95 },
    { pattern: /\/[^\/]+\/[^\/]+\/blob\//i, page_type: "documentation_page", confidence: 85 },
  ]
}

// Stack Overflow
stackoverflow: {
  domains: /stackoverflow\.com|stackexchange\.com/i,
  patterns: [
    { pattern: /\/questions\/\d+/i, page_type: "qna_page", confidence: 95 },
    { pattern: /\/users\/\d+/i, page_type: "profile_page", confidence: 90 },
    { pattern: /\/tags\/[^\/]+/i, page_type: "archive_page", confidence: 90 },
  ]
}

Priority Rules

Rule Evaluation Order

Rules are evaluated in strict priority order. Lower priority numbers are checked first:

Priority	Rule Type	Confidence
N/A	Owned domain match	100%
N/A	Domain database lookup	95%
N/A	TLD rules (.gov, .edu)	95%
N/A	Platform detection	85%
1-10	Homepage patterns	95-99%
11-19	Profile patterns	80-90%
20-29	Editorial patterns	75-85%
30-50	Commercial patterns	75-95%
51-70	Documentation patterns	75-85%
71-100	Utility patterns	70-95%
100+	Fallback patterns	70-80%

Conflict Resolution

When multiple patterns could match:

Platform-specific patterns win over generic patterns for known domains
Higher confidence patterns take precedence within the same category
Lower priority number wins when confidence is equal
More specific patterns (longer regex) generally have lower priority numbers

// Example: URL "/blog/my-article" on medium.com

// Platform pattern (checked first for medium.com domain)
medium: { pattern: /\/[^\/]+$/i, page_type: "blog_post", confidence: 90 }

// Generic pattern (would be checked if not Medium)
{ pattern: /\/(blog)\/[^\/]+\/?$/i, page_type: "blog_post", confidence: 75 }

// Result: Platform pattern wins with confidence 90

Structural Type Validation

The rules engine enforces parent-child relationships between structural_type and page_type:

// Valid combinations:
structural_type: "article" -> page_type: "blog_post", "news_article", "press_release"
structural_type: "listing" -> page_type: "category_page", "search_results_page"
structural_type: "detail" -> page_type: "product_page", "profile_page"
structural_type: "thread" -> page_type: "forum_thread", "qna_page"

// If page_type doesn't match structural_type, it's cleared:
if (!isPageTypeValidForStructural(pageType, structuralType)) {
  result.page_type = { value: null, confidence: 0, source: null };
}

Adding New Patterns

Step 1: Identify Pattern Need

Analyze URLs that are being misclassified or sent to LLM unnecessarily.

Step 2: Choose Location

Pattern Type	File Location
Generic path pattern	`url-patterns/generic-patterns.js`
Platform-specific	`url-patterns/platforms/{category}.js`
New platform	Create in `url-patterns/platforms/`

Step 3: Write the Pattern

{
  name: "podcast_episode",              // Descriptive name
  pattern: /\/podcasts?\/[^\/]+/i,      // Test thoroughly!
  page_type: "podcast_episode",         // Valid page_type from constants
  page_type_category: "editorial",      // Category grouping
  confidence: 88,                       // Be conservative
  priority: 119,                        // Higher = checked later
}

Step 4: Add Pattern

For generic patterns, add to the array in generic-patterns.js:

export const GENERIC_PATH_PATTERNS = [
  // ... existing patterns ...
  
  // New podcast patterns
  {
    name: "podcast_episode",
    pattern: /\/podcasts?\/[^\/]+/i,
    page_type: "podcast_episode",
    page_type_category: "editorial",
    confidence: 88,
    priority: 119,
  },
];

For platform-specific, add to the platform's patterns array:

spotify: {
  domains: /spotify\.com|open\.spotify\.com/i,
  patterns: [
    // ... existing ...
    {
      pattern: /\/episode\/[A-Za-z0-9]+/i,
      page_type: "podcast_episode",
      page_type_category: "editorial",
      confidence: 95,
    },
  ]
}

Step 5: Validate

Patterns are validated on module load against PAGE_TYPES constants:

// validation.js
export function validatePatterns(patterns, name) {
  for (const p of patterns) {
    if (!PAGE_TYPES[p.page_type.toUpperCase()]) {
      throw new Error(`Invalid page_type: ${p.page_type} in ${name}`);
    }
  }
}

Testing Patterns

Manual Testing

Use the classification endpoint to test individual URLs:

curl -X POST https://api.example.com/admin/classify-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/blog/my-article"}'

Batch Testing

Test patterns against real URLs from your database:

import { classifyWithRules } from "./classifier-rules-engine.js";

const testUrls = [
  "https://medium.com/@user/article-title-abc123",
  "https://github.com/owner/repo",
  "https://example.com/blog/my-post",
];

for (const url of testUrls) {
  const result = await classifyWithRules({
    url,
    domain: new URL(url).hostname,
  });
  console.log(url);
  console.log("  page_type:", result.page_type);
  console.log("  rules_applied:", result.rules_applied);
}

Regex Testing Tips

// Test your regex patterns:
const pattern = /\/(blog|blogs)\/[^\/]+\/?$/i;

// Should match:
console.log(pattern.test("/blog/my-article"));      // true
console.log(pattern.test("/blogs/my-article/"));    // true
console.log(pattern.test("/BLOG/Article-Title"));   // true (case insensitive)

// Should NOT match:
console.log(pattern.test("/blog/"));                // false (no slug)
console.log(pattern.test("/blog/cat/article"));     // false (nested path)
console.log(pattern.test("/myblog/article"));       // false (wrong prefix)

Unit Test Example

import { getGenericPathMatch } from "./url-patterns/index.js";

describe("Generic Path Patterns", () => {
  it("should match blog posts", () => {
    const result = getGenericPathMatch("https://example.com/blog/my-article");
    expect(result.page_type).toBe("blog_post");
    expect(result.confidence).toBeGreaterThanOrEqual(70);
  });
  
  it("should match pricing pages", () => {
    const result = getGenericPathMatch("https://example.com/pricing");
    expect(result.page_type).toBe("pricing_page");
    expect(result.confidence).toBe(90);
  });
});

Performance

Lazy Loading

The domain database (858KB) is lazy-loaded to avoid cold start penalty:

// Not loaded on module initialization
let domainDbModule = null;

// Only loaded when first classification runs
async function getDomainDb() {
  if (!domainDbModule) {
    domainDbModule = await import("../../data/domain-database.js");
  }
  return domainDbModule;
}

Pattern Compilation

Regex patterns are compiled once at module load time, not per-request:

// Patterns are RegExp objects, not strings
export const GENERIC_PATH_PATTERNS = [
  {
    pattern: /\/(blog|blogs)\/[^\/]+\/?$/i,  // Pre-compiled
    // ...
  }
];

Early Exit

The classifier exits early when high confidence is reached:

// If domain database returns 95% confidence, skip expensive stages
if (minCoreConfidence >= 80 && hasCoreCassifications) {
  result.needs_vectorize = false;
  result.needs_llm = false;
}

Pattern Stats

Track pattern coverage:

import { PATTERN_STATS } from "./url-patterns/index.js";

console.log(PATTERN_STATS);
// {
//   generic: 180,
//   platforms: 45,
//   platformPatterns: 520,
//   platformBreakdown: {
//     social: 12,
//     publishing: 8,
//     commerce: 10,
//     devTools: 6,
//     media: 5,
//     jobs: 4,
//     local: 5,
//     other: 5
//   }
// }

Common Pattern Examples

Most Useful Generic Patterns

// Homepage (99% confidence)
/^https?:\/\/[^\/]+\/?$/i
// Matches: https://example.com, https://example.com/

// Blog post (75% confidence)
/\/(blog|blogs)\/[^\/]+\/?$/i
// Matches: /blog/my-article, /blogs/great-post/

// Product page (80% confidence)
/\/(products?)\/[^\/]+\/?$/i
// Matches: /product/widget-123, /products/blue-shoes

// Documentation (80% confidence)
/\/(docs?|documentation)\/[^\/]+/i
// Matches: /docs/getting-started, /documentation/api/auth

// Pricing page (90% confidence)
/\/(pricing|plans|packages)\/?$/i
// Matches: /pricing, /plans/, /packages

// Legal pages (95% confidence)
/\/(privacy|privacy-policy|terms|tos)/i
// Matches: /privacy, /privacy-policy, /terms-of-service

// Profile page (85-90% confidence)
/\/@[^\/]+\/?$/i
// Matches: /@username, /@john-doe/

/\/(user|users|member|members)\/[^\/]+/i
// Matches: /user/john, /members/12345

// Date-based news (85% confidence)
/\/\d{4}\/\d{2}\/\d{2}\/[^\/]+/i
// Matches: /2024/01/15/breaking-news

// Listicle (85-90% confidence)
/\/\d+-best[-\/]/i
// Matches: /10-best-tools/, /5-best-practices

// Location page (80-85% confidence)
/\/locations?\/[^\/]+/i
// Matches: /location/new-york, /locations/california

// State-based location (85% confidence)
/\/[a-z-]+-(?:ca|tx|ny|fl|wa)\/?$/i
// Matches: /houston-tx/, /seattle-wa/

Platform-Specific Pattern Examples

// YouTube video (95%)
/youtube\.com\/watch\?v=[A-Za-z0-9_-]+/i
// Matches: youtube.com/watch?v=dQw4w9WgXcQ

// GitHub repo (90%)
/github\.com\/[^\/]+\/[^\/]+\/?$/i
// Matches: github.com/owner/repo

// Reddit thread (95%)
/reddit\.com\/r\/[^\/]+\/comments\//i
// Matches: reddit.com/r/programming/comments/abc123/title

// Twitter/X post (95%)
/(twitter|x)\.com\/[^\/]+\/status\/\d+/i
// Matches: twitter.com/user/status/123456789

// Stack Overflow question (95%)
/stackoverflow\.com\/questions\/\d+/i
// Matches: stackoverflow.com/questions/12345/how-to-do-x

// Medium article (95%)
/medium\.com\/@[^\/]+\/[^\/]+-[a-f0-9]+$/i
// Matches: medium.com/@user/article-title-abc123def

// Amazon product (95%)
/amazon\.com\/dp\/[A-Z0-9]+/i
// Matches: amazon.com/dp/B08N5WRWNW

Source Files

File	Purpose
`src/lib/classification/classifier-rules-engine.js`	Main rules engine
`src/lib/classification/classification-constants.js`	Enums, mappings, helpers
`src/data/domain-database.js`	Generated domain classifications
`src/data/url-patterns/index.js`	Pattern aggregator
`src/data/url-patterns/generic-patterns.js`	Platform-agnostic patterns
`src/data/url-patterns/platforms/*.js`	Platform-specific patterns
`src/data/url-patterns/validation.js`	Pattern validation

Classification Dimensions - All valid classification values
Domain Classification - Domain classification pipeline
URL Classification - Full classification pipeline
Pipeline Plan - Multi-stage classification overview

Overview​

How Pattern Matching Works​

Classification Output Structure​

Domain Database​

Database Structure​

Domain Type Distribution​

Lazy Loading​

Adding Domains to Database​

URL Patterns​

Pattern Syntax​

Pattern Components​

Common Pattern Examples​

Pattern Categories​

Generic Path Patterns​

Editorial Patterns (priority 20-29)​

Listicle Patterns (priority 29)​

Commercial Patterns (priority 31-50)​

Documentation Patterns (priority 51-65)​

Utility Patterns (priority 70-95)​

UGC Patterns (priority 11-85)​

Platform-Specific Patterns​

Social Platforms (social.js)​

Publishing Platforms (publishing.js)​

E-commerce Platforms (commerce.js)​

Developer Tools (dev-tools.js)​

Priority Rules​

Rule Evaluation Order​

Conflict Resolution​

Structural Type Validation​

Adding New Patterns​

Step 1: Identify Pattern Need​

Step 2: Choose Location​

Step 3: Write the Pattern​

Step 4: Add Pattern​

Step 5: Validate​

Testing Patterns​

Manual Testing​

Batch Testing​

Regex Testing Tips​

Unit Test Example​

Performance​

Lazy Loading​

Pattern Compilation​

Early Exit​

Pattern Stats​

Common Pattern Examples​

Most Useful Generic Patterns​

Platform-Specific Pattern Examples​

Source Files​

Related Documentation​

Overview

How Pattern Matching Works

Classification Output Structure

Domain Database

Database Structure

Domain Type Distribution

Lazy Loading

Adding Domains to Database

URL Patterns

Pattern Syntax

Pattern Components

Common Pattern Examples

Pattern Categories

Generic Path Patterns

Editorial Patterns (priority 20-29)

Listicle Patterns (priority 29)

Commercial Patterns (priority 31-50)

Documentation Patterns (priority 51-65)

Utility Patterns (priority 70-95)

UGC Patterns (priority 11-85)

Platform-Specific Patterns

Social Platforms (`social.js`)

Publishing Platforms (`publishing.js`)

E-commerce Platforms (`commerce.js`)

Developer Tools (`dev-tools.js`)

Priority Rules

Rule Evaluation Order

Conflict Resolution

Structural Type Validation

Adding New Patterns

Step 1: Identify Pattern Need

Step 2: Choose Location

Step 3: Write the Pattern

Step 4: Add Pattern

Step 5: Validate

Testing Patterns

Manual Testing

Batch Testing

Regex Testing Tips

Unit Test Example

Performance

Lazy Loading

Pattern Compilation

Early Exit

Pattern Stats

Common Pattern Examples

Most Useful Generic Patterns

Platform-Specific Pattern Examples

Source Files

Related Documentation