URL Classification Pipeline

Overview

The URL classification pipeline is a 5-stage system that progressively classifies URLs from cheapest to most expensive methods. The architecture is designed around cost optimization - most URLs are classified in the free early stages, with expensive LLM calls reserved only for truly ambiguous cases.

Pipeline Philosophy

Cost-Ordered Stages

The pipeline is ordered from free to expensive:

Stage	Name	Cost	Latency	Purpose
0	Cache	$0	~5ms	Reuse existing domain classifications
1	Rules	$0	~10ms	Pattern matching against 7,000+ known domains
2	Vectorize	~$0.00001	~50ms	ML similarity to labeled examples
3	Content	$0-$0.001	~500ms	Page fetch and signal extraction
4	LLM	~$0.0002	~1s	Workers AI classification

Early Exit Optimization

Each stage can exit early if confidence thresholds are met:

// From classification-config.js
export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;

The pipeline stops as soon as:

A stage produces >= 65% confidence on core dimensions
We have a valid page_type classification
The needs_llm flag is NOT set

Per-Dimension Confidence

Every classification dimension is tracked separately with {value, confidence, source}:

{
  domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
  page_type: { value: "blog_post", confidence: 80, source: "url_path_pattern" },
  channel_bucket: { value: "owned_content_marketing", confidence: 85, source: "derived_from_domain_type" },
  structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" }
}

This allows:

Targeted improvement of low-confidence dimensions
Intelligent merging across stages (higher confidence wins)
Detailed classification audit trails

Stage 0: Cache Lookup

Purpose

Check if the domain has already been classified. Since domain-level attributes (domain_type, tier1_type, channel_bucket) are stable across URLs from the same domain, caching provides massive speedups.

Implementation

// From url-classifier.js
if (!skip_domain_cache && env?.DB && input.domain) {
  const domainResult = await getCachedDomainClassification(input.domain, env);
  
  if (domainResult && domainResult.property_type) {
    // Cache HIT - use domain classification
    cachedDomainClassification = domainResult;
    result.stages_run.push({
      stage: 0,
      name: "domain_cache",
      hit: true,
      property_type: domainResult.property_type,
      confidence: domainResult.classification_confidence,
      cost: 0,
    });
  }
}

Cache Hierarchy

KV Cache (fastest): Check LOOKUP_CACHE KV namespace first
D1 Lookup (fallback): Query domains table if KV miss
KV Backfill: Store D1 results in KV for future lookups

// KV cache key format
const cacheKey = `domain-class:${domainHash}`;

// TTL: 24 hours for domain classifications
const DOMAIN_CACHE_TTL = 86400;

When to Skip Cache

The skip_domain_cache option bypasses cache when:

Force-reclassifying a domain
Testing classification changes
Domain was recently corrected

Stage 1: Rules Engine

Purpose

Fast, free classification using deterministic pattern matching. This stage handles ~70% of all URLs without any external API calls.

Rule Categories

Rule 1: Owned Domain Detection (100% confidence)

if (target_domain && isOwnedDomain(normalizedDomain, target_domain)) {
  result.channel_bucket = {
    value: CHANNEL_BUCKETS.OWNED_BRAND_SITE,
    confidence: 100,
    source: "owned_domain_match",
  };
}

Detects:

Exact match: company.com === company.com
Subdomain: blog.company.com belongs to company.com
Common variants: getcompany.com, company.io, trycompany.com

Rule 2: Domain Database (95% confidence)

A curated database of 7,000+ domains with known classifications:

const v2Classification = getDomainClassification(normalizedDomain);
if (v2Classification) {
  result.domain_type = {
    value: normalizedDomainType,
    confidence: 95,
    source: "domain_database_v2",
  };
}

Database includes major sites like:

News: nytimes.com, techcrunch.com, theverge.com
SaaS: notion.so, slack.com, github.com
E-commerce: amazon.com, shopify.com, etsy.com
Social: twitter.com, linkedin.com, reddit.com

Rule 3: TLD-Based Rules (95% confidence)

const TLD_RULES = {
  ".gov": DOMAIN_TYPES.GOVERNMENT_SITE,
  ".edu": DOMAIN_TYPES.EDUCATION_ACADEMIC,
  ".mil": DOMAIN_TYPES.GOVERNMENT_SITE,
  ".org": null, // Needs further classification
  ".ac.uk": DOMAIN_TYPES.EDUCATION_ACADEMIC,
};

Rule 4: Platform Detection (85% confidence)

Detects known platforms from URL patterns:

const PLATFORM_URL_PATTERNS = {
  "reddit.com": {
    domain_type: "forum_community",
    page_type_regex: [
      { pattern: /\/r\/[^/]+\/comments\//, page_type: "forum_thread" },
      { pattern: /\/r\/[^/]+\/?$/, page_type: "category_index_page" },
    ]
  },
  "github.com": {
    domain_type: "code_repository",
    page_type_regex: [
      { pattern: /\/[^/]+\/[^/]+\/?$/, page_type: "repository_page" },
      { pattern: /\/issues\/\d+/, page_type: "forum_thread" },
    ]
  }
};

Rule 5: URL Path Patterns (70-85% confidence)

Generic path patterns that indicate page type:

Pattern	Page Type	Confidence
`/blog/`, `/posts/`	blog_post	85%
`/docs/`, `/documentation/`	documentation_page	90%
`/product/`, `/products/`	product_page	85%
`/about`, `/about-us`	about_page	90%
`/pricing`	pricing_page	85%
`/contact`	contact_page	90%

Rule 6: Spam/Risk Detection (80-90% confidence)

Detects risky URLs that should be flagged:

// Comment spam patterns
if (/#comment-\d+/i.test(url) || /\?replytocom=/i.test(url)) {
  return { structural_type: "spam", confidence: 90 };
}

// Suspicious TLDs
const SUSPICIOUS_TLDS = [".tk", ".ml", ".ga", ".cf", ".gq", ".xyz"];

// PBN indicators
const PBN_PATTERNS = [
  /^[a-z]+-[a-z]+-[a-z]+-[a-z]+\./,  // keyword-stuffed domains
  /\d{5,}\./,                         // long number sequences
];

Structural Type Detection

Critical Architecture: The rules engine classifies structural_type FIRST, then constrains page_type to valid children:

Stage 2: Vectorize (ML Similarity)

Purpose

Find similar URLs from a labeled index and vote on classification. This stage catches URLs that don't match explicit rules but are similar to known examples.

How It Works

Feature Text Generation

URLs are converted to searchable text that captures their characteristics:

function urlToFeatureText(input) {
  const parts = [];
  
  // Domain info
  parts.push(`domain: ${normalizedDomain}`);
  parts.push(`tld: ${tld}`);
  
  // URL path components
  parts.push(`path: ${pathParts.join(" ")}`);
  
  // Page title (huge signal)
  if (effectiveTitle) {
    parts.push(`title: ${cleanTitle}`);
  }
  
  // SERP snippet
  if (serp_description) {
    parts.push(`snippet: ${cleanDesc}`);
  }
  
  // Authority level
  if (domain_rank >= 80) parts.push("authority: tier1");
  
  return parts.join(" | ");
}

Similarity Thresholds

Different fields require different similarity scores:

const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75;  // Domain type needs high similarity
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80;    // Page type needs very high similarity
const MIN_SIMILARITY_FOR_ANY = 0.65;          // Absolute minimum to use any result

Voting Algorithm

function voteOnClassifications(matches) {
  const votes = {};
  
  for (const match of matches) {
    const classifications = match.metadata?.classifications || {};
    const value = classifications[field];
    if (value) {
      // Weight by similarity score
      votes[value] = (votes[value] || 0) + match.score;
    }
  }
  
  // Find winner
  return Object.entries(votes)
    .sort((a, b) => b[1] - a[1])[0];
}

Self-Reinforcement

High-confidence Vectorize results reinforce themselves:

// When Vectorize confidence >= 85%, prepare feedback
if (result.final_confidence >= 85) {
  result.vectorize_feedback = {
    url: input.url,
    domain: input.domain,
    classifications: { ... },
    source: "vectorize_reinforcement",
  };
}

Stage 3: Content Analysis

Purpose

Fetch actual page content and extract classification signals. This stage is expensive (requires HTTP requests) but provides rich signals that rules and Vectorize cannot.

Fetch Strategy

Content is fetched using a tiered fallback system:

Per-Domain Throttling

To avoid hammering individual sites, fetches are throttled per-domain:

return await withDomainThrottle(
  domain,
  env,
  async () => {
    return await _fetchWithFallbacksInner(url, domain, env, options);
  },
  { maxWaitMs: 1000, retries: 3, skipOnLimit: false }
);

Fetch Method Caching

Successful fetch methods are cached to skip failed approaches:

// If native fetch was blocked before, skip to ZenRows
const cachedMethod = await getCachedFetchMethod(env, domain);
const skipNative = cachedMethod === "zenrows";

// Cache successful method for 7 days
await cacheFetchMethod(env, domain, "native");

Signal Extraction

From raw HTML, we extract classification signals:

function extractSignalsFromHtml(html, url) {
  const signals = {};
  
  // Core metadata
  signals.title = extractTitle(html);
  signals.description = extractMetaDescription(html);
  signals.word_count = countWords(html);
  signals.h1_tags = extractH1Tags(html);
  
  // Affiliate/Commercial signals
  signals.has_affiliate_links = /affiliate|\?ref=|\?tag=/i.test(html);
  signals.has_sponsored_disclosure = /sponsored\s*content/i.test(html);
  signals.has_price_mentions = /\$\d+/i.test(html);
  signals.has_buy_cta = /buy now|shop now|get started/i.test(html);
  
  // Review/Comparison signals
  signals.has_rating_schema = /"@type"\s*:\s*"Review"/i.test(html);
  signals.has_pros_cons = /\b(pros|cons)\b/i.test(html);
  signals.has_comparison_table = /\b(vs|versus|compare)\b/i.test(html);
  
  // Expert signals
  signals.has_expert_byline = /\b(ph\.?d|m\.?d|ceo|founder)\b/i.test(html);
  signals.has_data_citations = /according to|study shows/i.test(html);
  
  // Schema.org types
  signals.schema_types = extractSchemaTypes(html);
  
  return signals;
}

Signal-Based Page Type Inference

function inferPageTypeFromContent(signals) {
  if (signals.is_press_release_format) return "press_release";
  if (signals.has_rating_schema || signals.has_pros_cons) {
    if (signals.has_comparison_table) return "comparison_page";
    return "review_page";
  }
  if (signals.schema_types?.includes("faq")) return "faq_page";
  if (signals.schema_types?.includes("howto")) return "howto_article";
  if (signals.schema_types?.includes("article")) return "blog_post";
  return null;
}

SERP Data Optimization

For URLs with SERP data from ranked keywords, content fetch can be skipped:

// Skip expensive content fetch when we have SERP data
skip_content: skip_content || !!(serp_title || serp_description),

This saves ~$0.001/URL for ranking URLs while still providing classification context.

Stage 4: LLM Classification

Purpose

Final fallback using Workers AI (Llama 3.3 70B) for URLs that couldn't be classified with sufficient confidence by earlier stages.

When LLM is Triggered

// LLM needed if:
// 1. No domain_type or very low confidence
result.needs_llm = !result.domain_type.value || result.domain_type.confidence < 50;

// 2. Content parsing didn't help enough
if (result.confidence < 60) {
  stage1Result.needs_llm = true;
}

// 3. Rules engine explicitly flagged it
if (stage1Result.needs_llm) {
  // Domain type couldn't be determined from patterns
}

Budget Enforcement

LLM calls are budget-capped to prevent cost spirals:

const MAX_DAILY_LLM_BUDGET_USD = 50.0;

const dailyCost = await getDailyServiceCost(env, COST_SERVICES.CF_WORKERS_AI);
if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
  return {
    skipped: true,
    skip_reason: "daily_budget_exceeded",
    needs_review: true,
  };
}

System Prompt Architecture

The LLM is given a tree-based classification prompt:

=== CLASSIFICATION APPROACH: TWO-LEVEL TREE ===
CRITICAL ORDER:
1. tier1_type FIRST -> constrains domain_type
2. structural_type FIRST -> constrains page_type

=== TIER1_TYPE (7 universal archetypes) ===
- "platform" - SaaS, apps, tools
- "marketplace" - Multi-sided listings
- "commerce" - Direct retail, D2C
- "service" - Sells services
- "information" - Content publishing
- "community" - Forums, social, UGC
- "institutional" - Government, education

Few-Shot Examples

The prompt includes examples to guide classification:

**Example 1: Code Repository**
URL: https://github.com/anthropics/claude-code
-> {"tier1_type": "platform", "domain_type": "code_repository", 
    "structural_type": "detail", "page_type": "repository_page"}

**Example 2: News Article**
URL: https://www.nytimes.com/2024/01/15/technology/ai-advances.html
-> {"tier1_type": "information", "domain_type": "news_publisher",
    "structural_type": "article", "page_type": "news_article"}

Response Parsing and Normalization

LLM outputs are normalized to valid enum values:

const PAGE_TYPE_ALIASES = {
  "terms_of_use_page": "legal_terms_page",
  "article_page": "blog_post",
  "product_detail_page": "product_page",
  "application_page": "signup_page",
  // ... 50+ aliases
};

function normalizeEnumValue(value, enumObj, aliases) {
  const normalized = value.toLowerCase().trim();
  
  // Check exact match
  if (isValidEnumValue(normalized, enumObj)) return normalized;
  
  // Check aliases
  if (aliases[normalized]) return aliases[normalized];
  
  // Fuzzy matching
  return findClosestMatch(normalized, enumObj);
}

Validation Against Constraints

LLM outputs are validated against the tree structure:

// Validate page_type is valid for structural_type
if (!isPageTypeValidForStructural(pageTypeValue, structuralType)) {
  const validPageTypes = getValidPageTypesForStructural(structuralType);
  const corrected = validPageTypes.find(pt => pt.includes(pageTypeValue.split("_")[0]));
  if (corrected) {
    pageTypeValue = corrected;
  }
}

// Validate domain_type is valid for tier1_type
const validTypes = DOMAIN_TYPES_BY_TIER1[result.tier1_type] || [];
if (!validTypes.includes(result.domain_type)) {
  // Find closest valid type
}

Risky Blackhat Rejection

LLM sometimes over-classifies as spam. This is caught and corrected:

if (proposedChannel === "risky_blackhat") {
  const isActuallySpammy = isSpammyDomain(url, domain);
  
  if (!isActuallySpammy) {
    console.log(`Rejected risky_blackhat for legitimate domain: ${domain}`);
    // Derive channel from domain_type instead
    result.channel_bucket = deriveChannelFromDomainType(domainType);
  }
}

Self-Learning System

Overview

The pipeline gets smarter over time through a feedback loop:

Learning Triggers

// From classification-config.js
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;

// URL learning
if (classification.final_confidence >= LEARNING_MIN_CONFIDENCE) {
  await learnFromClassification(input, classification, confidence, env);
}

// Domain learning (stricter threshold)
if (shouldTriggerLearning(confidence, "domain")) {
  // Only at 80%+ confidence
}

What Gets Learned

Not all classifications feed the learning loop:

export async function learnFromClassification(input, classification, confidence, env) {
  // Skip rule-only classifications (already deterministic)
  if (classification.source === "rule") {
    return { learned: false, reason: "rule_based" };
  }
  
  // Skip spam classifications (would pollute index)
  if (domainType === "pbn_suspected" || domainType === "spam_low_quality") {
    return { learned: false, reason: "spam_classification" };
  }
  
  // Add to Vectorize
  const result = await addClassifiedUrl(input, classification, env, "learned");
  return { learned: true, ...result };
}

Vectorize Index Updates

async function addClassifiedUrl(input, classification, env, source = "llm") {
  // Convert to feature text
  const featureText = urlToFeatureText({
    url,
    domain,
    partial_classification: classification,
  });
  
  // Generate embedding
  const embedding = await getEmbedding(featureText, env);
  
  // Upsert to Vectorize
  await upsertVector(
    id,
    embedding,
    {
      url,
      domain,
      classifications: classification,
      source,
      indexed_at: Date.now(),
    },
    env
  );
}

Confidence Thresholds

Threshold Alignment Philosophy

All thresholds are centralized in classification-config.js with clear rationale:

/**
 * RATIONALE FOR THRESHOLD ALIGNMENT:
 * - Cache & Pipeline: Same threshold (65%) so cached results behave identically
 *   to fresh pipeline runs - no "gap" where cache accepts what pipeline wouldn't
 * - Learning: Higher threshold (75%) to avoid polluting Vectorize with uncertain
 *   classifications that could propagate errors to future predictions
 * - Domain Learning: Highest threshold (80%) because domain-level classifications
 *   affect many URLs and errors are more costly
 */

export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;

Per-Stage Thresholds

Stage	Exit Threshold	Learning Threshold
0: Cache	>= 60%	N/A
1: Rules	>= 80% + page_type	>= 95%
2: Vectorize	>= 70%	>= 85%
3: Content	>= 60%	N/A
4: LLM	Always exits	>= 75%

Confidence Calculation

Final confidence is the MAX of all dimension confidences:

function calculateFinalConfidence(classification) {
  const confidences = [
    getConfidence(classification.domain_type),
    getConfidence(classification.tier1_type),
    getConfidence(classification.channel_bucket),
    getConfidence(classification.page_type),
    getConfidence(classification.quality_tier),
  ].filter(c => c > 0);
  
  return confidences.length > 0 ? Math.max(...confidences) : 0;
}

Cost Optimization

Cost Breakdown

Stage	Cost per URL	Typical Hit Rate	Expected Cost
Cache	$0	40%	$0
Rules	$0	50%	$0
Vectorize	$0.00001	30%	$0.000003
Content	$0.000125	25%	$0.00003
LLM	$0.0002	5%	$0.00001

Expected cost per URL: ~$0.00004 (assuming typical distribution)

Budget Tracking

All costs are tracked via the cost tracking service:

await trackCost(env, {
  service: COST_SERVICES.CF_WORKERS_AI,
  cost_usd: 0.0002,
  units: 1,
  metadata: { url, model: LLM_MODEL },
});

await trackCost(env, {
  service: COST_SERVICES.DATAFORSEO_INSTANT_PAGES,
  cost_usd: actualCost,
  units: 1,
  metadata: { url },
});

Daily Budget Caps

LLM calls are capped at $50/day:

const MAX_DAILY_LLM_BUDGET_USD = 50.0;

if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
  return { skipped: true, skip_reason: "daily_budget_exceeded" };
}

Optimization Strategies

SERP Data Bypass: Skip content fetch for URLs with SERP data
Fetch Method Caching: Remember which domains need ZenRows
Domain Batch Processing: Classify domain once, apply to all URLs
Early Exit: Stop pipeline as soon as confidence threshold met

Queue Processing

Message Types

The URL classify queue handles multiple message types:

// Classify a single URL
{ type: "classify_url", url_id, url, domain, domain_rank }

// Classify a backlink (source + target URL)
{ type: "classify_backlink", backlink_id, source_url_id, source_url, target_url }

// Batch classify all URLs from a domain
{ type: "classify_domain_batch", domain, domain_rank, urls: [...] }

Parallel Processing

Messages are processed in parallel for throughput:

const processingPromises = reorderedMessages.map(async (message) => {
  // Process each message
});

const processingResults = await Promise.allSettled(processingPromises);

Domain-Diverse Ordering

Messages are reordered to maximize domain diversity:

function reorderForDomainDiversity(messages) {
  // Instead of: [a.com, a.com, a.com, b.com, b.com, c.com]
  // We process: [a.com, b.com, c.com, a.com, b.com, a.com]
  // This spreads load and reduces per-domain throttle contention
}

Retry Logic

Failed messages use exponential backoff:

if (shouldRetry && attempts < MAX_QUEUE_RETRIES) {
  message.retry({
    delaySeconds: Math.pow(2, attempts) * 5,  // 5s, 10s, 20s
  });
}

Files Reference

File	Purpose
`url-classifier.js`	Main orchestrator - runs all 5 stages
`classifier-rules-engine.js`	Stage 1 - Pattern matching
`classifier-vectorize.js`	Stage 2 - ML similarity
`classifier-content-parser.js`	Stage 3 - Content fetch & signals
`classifier-llm.js`	Stage 4 - LLM classification
`classification-config.js`	Thresholds and configuration
`classification-constants.js`	Enums and valid values
`url-classify-consumer.js`	Queue consumer

Overview​

Pipeline Philosophy​

Cost-Ordered Stages​

Early Exit Optimization​

Per-Dimension Confidence​

Stage 0: Cache Lookup​

Purpose​

Implementation​

Cache Hierarchy​

When to Skip Cache​

Stage 1: Rules Engine​

Purpose​

Rule Categories​

Rule 1: Owned Domain Detection (100% confidence)​

Rule 2: Domain Database (95% confidence)​

Rule 3: TLD-Based Rules (95% confidence)​

Rule 4: Platform Detection (85% confidence)​

Rule 5: URL Path Patterns (70-85% confidence)​

Rule 6: Spam/Risk Detection (80-90% confidence)​

Structural Type Detection​

Stage 2: Vectorize (ML Similarity)​

Purpose​

How It Works​

Feature Text Generation​

Similarity Thresholds​

Voting Algorithm​

Self-Reinforcement​

Stage 3: Content Analysis​

Purpose​

Fetch Strategy​

Per-Domain Throttling​

Fetch Method Caching​

Signal Extraction​

Signal-Based Page Type Inference​

SERP Data Optimization​

Stage 4: LLM Classification​

Purpose​

When LLM is Triggered​

Budget Enforcement​

System Prompt Architecture​

Few-Shot Examples​

Response Parsing and Normalization​

Validation Against Constraints​

Risky Blackhat Rejection​

Self-Learning System​

Overview​

Learning Triggers​

What Gets Learned​

Vectorize Index Updates​

Confidence Thresholds​

Threshold Alignment Philosophy​

Per-Stage Thresholds​

Confidence Calculation​

Cost Optimization​

Cost Breakdown​

Budget Tracking​

Daily Budget Caps​

Optimization Strategies​

Queue Processing​

Message Types​

Parallel Processing​

Domain-Diverse Ordering​

Retry Logic​

Files Reference​

Overview

Pipeline Philosophy

Cost-Ordered Stages

Early Exit Optimization

Per-Dimension Confidence

Stage 0: Cache Lookup

Purpose

Implementation

Cache Hierarchy

When to Skip Cache

Stage 1: Rules Engine

Purpose

Rule Categories

Rule 1: Owned Domain Detection (100% confidence)

Rule 2: Domain Database (95% confidence)

Rule 3: TLD-Based Rules (95% confidence)

Rule 4: Platform Detection (85% confidence)

Rule 5: URL Path Patterns (70-85% confidence)

Rule 6: Spam/Risk Detection (80-90% confidence)

Structural Type Detection

Stage 2: Vectorize (ML Similarity)

Purpose

How It Works

Feature Text Generation

Similarity Thresholds

Voting Algorithm

Self-Reinforcement

Stage 3: Content Analysis

Purpose

Fetch Strategy

Per-Domain Throttling

Fetch Method Caching

Signal Extraction

Signal-Based Page Type Inference

SERP Data Optimization

Stage 4: LLM Classification

Purpose

When LLM is Triggered

Budget Enforcement

System Prompt Architecture

Few-Shot Examples

Response Parsing and Normalization

Validation Against Constraints

Risky Blackhat Rejection

Self-Learning System

Overview

Learning Triggers

What Gets Learned

Vectorize Index Updates

Confidence Thresholds

Threshold Alignment Philosophy

Per-Stage Thresholds

Confidence Calculation

Cost Optimization

Cost Breakdown

Budget Tracking

Daily Budget Caps

Optimization Strategies

Queue Processing

Message Types

Parallel Processing

Domain-Diverse Ordering

Retry Logic

Files Reference