Skip to main content

URL Classification Pipeline

Overview

The URL classification pipeline is a 5-stage system that progressively classifies URLs from cheapest to most expensive methods. The architecture is designed around cost optimization - most URLs are classified in the free early stages, with expensive LLM calls reserved only for truly ambiguous cases.


Pipeline Philosophy

Cost-Ordered Stages

The pipeline is ordered from free to expensive:

StageNameCostLatencyPurpose
0Cache$0~5msReuse existing domain classifications
1Rules$0~10msPattern matching against 7,000+ known domains
2Vectorize~$0.00001~50msML similarity to labeled examples
3Content$0-$0.001~500msPage fetch and signal extraction
4LLM~$0.0002~1sWorkers AI classification

Early Exit Optimization

Each stage can exit early if confidence thresholds are met:

// From classification-config.js
export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;

The pipeline stops as soon as:

  1. A stage produces >= 65% confidence on core dimensions
  2. We have a valid page_type classification
  3. The needs_llm flag is NOT set

Per-Dimension Confidence

Every classification dimension is tracked separately with {value, confidence, source}:

{
domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
page_type: { value: "blog_post", confidence: 80, source: "url_path_pattern" },
channel_bucket: { value: "owned_content_marketing", confidence: 85, source: "derived_from_domain_type" },
structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" }
}

This allows:

  • Targeted improvement of low-confidence dimensions
  • Intelligent merging across stages (higher confidence wins)
  • Detailed classification audit trails

Stage 0: Cache Lookup

Purpose

Check if the domain has already been classified. Since domain-level attributes (domain_type, tier1_type, channel_bucket) are stable across URLs from the same domain, caching provides massive speedups.

Implementation

// From url-classifier.js
if (!skip_domain_cache && env?.DB && input.domain) {
const domainResult = await getCachedDomainClassification(input.domain, env);

if (domainResult && domainResult.property_type) {
// Cache HIT - use domain classification
cachedDomainClassification = domainResult;
result.stages_run.push({
stage: 0,
name: "domain_cache",
hit: true,
property_type: domainResult.property_type,
confidence: domainResult.classification_confidence,
cost: 0,
});
}
}

Cache Hierarchy

  1. KV Cache (fastest): Check LOOKUP_CACHE KV namespace first
  2. D1 Lookup (fallback): Query domains table if KV miss
  3. KV Backfill: Store D1 results in KV for future lookups
// KV cache key format
const cacheKey = `domain-class:${domainHash}`;

// TTL: 24 hours for domain classifications
const DOMAIN_CACHE_TTL = 86400;

When to Skip Cache

The skip_domain_cache option bypasses cache when:

  • Force-reclassifying a domain
  • Testing classification changes
  • Domain was recently corrected

Stage 1: Rules Engine

Purpose

Fast, free classification using deterministic pattern matching. This stage handles ~70% of all URLs without any external API calls.

Rule Categories

Rule 1: Owned Domain Detection (100% confidence)

if (target_domain && isOwnedDomain(normalizedDomain, target_domain)) {
result.channel_bucket = {
value: CHANNEL_BUCKETS.OWNED_BRAND_SITE,
confidence: 100,
source: "owned_domain_match",
};
}

Detects:

  • Exact match: company.com === company.com
  • Subdomain: blog.company.com belongs to company.com
  • Common variants: getcompany.com, company.io, trycompany.com

Rule 2: Domain Database (95% confidence)

A curated database of 7,000+ domains with known classifications:

const v2Classification = getDomainClassification(normalizedDomain);
if (v2Classification) {
result.domain_type = {
value: normalizedDomainType,
confidence: 95,
source: "domain_database_v2",
};
}

Database includes major sites like:

  • News: nytimes.com, techcrunch.com, theverge.com
  • SaaS: notion.so, slack.com, github.com
  • E-commerce: amazon.com, shopify.com, etsy.com
  • Social: twitter.com, linkedin.com, reddit.com

Rule 3: TLD-Based Rules (95% confidence)

const TLD_RULES = {
".gov": DOMAIN_TYPES.GOVERNMENT_SITE,
".edu": DOMAIN_TYPES.EDUCATION_ACADEMIC,
".mil": DOMAIN_TYPES.GOVERNMENT_SITE,
".org": null, // Needs further classification
".ac.uk": DOMAIN_TYPES.EDUCATION_ACADEMIC,
};

Rule 4: Platform Detection (85% confidence)

Detects known platforms from URL patterns:

const PLATFORM_URL_PATTERNS = {
"reddit.com": {
domain_type: "forum_community",
page_type_regex: [
{ pattern: /\/r\/[^/]+\/comments\//, page_type: "forum_thread" },
{ pattern: /\/r\/[^/]+\/?$/, page_type: "category_index_page" },
]
},
"github.com": {
domain_type: "code_repository",
page_type_regex: [
{ pattern: /\/[^/]+\/[^/]+\/?$/, page_type: "repository_page" },
{ pattern: /\/issues\/\d+/, page_type: "forum_thread" },
]
}
};

Rule 5: URL Path Patterns (70-85% confidence)

Generic path patterns that indicate page type:

PatternPage TypeConfidence
/blog/, /posts/blog_post85%
/docs/, /documentation/documentation_page90%
/product/, /products/product_page85%
/about, /about-usabout_page90%
/pricingpricing_page85%
/contactcontact_page90%

Rule 6: Spam/Risk Detection (80-90% confidence)

Detects risky URLs that should be flagged:

// Comment spam patterns
if (/#comment-\d+/i.test(url) || /\?replytocom=/i.test(url)) {
return { structural_type: "spam", confidence: 90 };
}

// Suspicious TLDs
const SUSPICIOUS_TLDS = [".tk", ".ml", ".ga", ".cf", ".gq", ".xyz"];

// PBN indicators
const PBN_PATTERNS = [
/^[a-z]+-[a-z]+-[a-z]+-[a-z]+\./, // keyword-stuffed domains
/\d{5,}\./, // long number sequences
];

Structural Type Detection

Critical Architecture: The rules engine classifies structural_type FIRST, then constrains page_type to valid children:


Stage 2: Vectorize (ML Similarity)

Purpose

Find similar URLs from a labeled index and vote on classification. This stage catches URLs that don't match explicit rules but are similar to known examples.

How It Works

Feature Text Generation

URLs are converted to searchable text that captures their characteristics:

function urlToFeatureText(input) {
const parts = [];

// Domain info
parts.push(`domain: ${normalizedDomain}`);
parts.push(`tld: ${tld}`);

// URL path components
parts.push(`path: ${pathParts.join(" ")}`);

// Page title (huge signal)
if (effectiveTitle) {
parts.push(`title: ${cleanTitle}`);
}

// SERP snippet
if (serp_description) {
parts.push(`snippet: ${cleanDesc}`);
}

// Authority level
if (domain_rank >= 80) parts.push("authority: tier1");

return parts.join(" | ");
}

Similarity Thresholds

Different fields require different similarity scores:

const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75;  // Domain type needs high similarity
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80; // Page type needs very high similarity
const MIN_SIMILARITY_FOR_ANY = 0.65; // Absolute minimum to use any result

Voting Algorithm

function voteOnClassifications(matches) {
const votes = {};

for (const match of matches) {
const classifications = match.metadata?.classifications || {};
const value = classifications[field];
if (value) {
// Weight by similarity score
votes[value] = (votes[value] || 0) + match.score;
}
}

// Find winner
return Object.entries(votes)
.sort((a, b) => b[1] - a[1])[0];
}

Self-Reinforcement

High-confidence Vectorize results reinforce themselves:

// When Vectorize confidence >= 85%, prepare feedback
if (result.final_confidence >= 85) {
result.vectorize_feedback = {
url: input.url,
domain: input.domain,
classifications: { ... },
source: "vectorize_reinforcement",
};
}

Stage 3: Content Analysis

Purpose

Fetch actual page content and extract classification signals. This stage is expensive (requires HTTP requests) but provides rich signals that rules and Vectorize cannot.

Fetch Strategy

Content is fetched using a tiered fallback system:

Per-Domain Throttling

To avoid hammering individual sites, fetches are throttled per-domain:

return await withDomainThrottle(
domain,
env,
async () => {
return await _fetchWithFallbacksInner(url, domain, env, options);
},
{ maxWaitMs: 1000, retries: 3, skipOnLimit: false }
);

Fetch Method Caching

Successful fetch methods are cached to skip failed approaches:

// If native fetch was blocked before, skip to ZenRows
const cachedMethod = await getCachedFetchMethod(env, domain);
const skipNative = cachedMethod === "zenrows";

// Cache successful method for 7 days
await cacheFetchMethod(env, domain, "native");

Signal Extraction

From raw HTML, we extract classification signals:

function extractSignalsFromHtml(html, url) {
const signals = {};

// Core metadata
signals.title = extractTitle(html);
signals.description = extractMetaDescription(html);
signals.word_count = countWords(html);
signals.h1_tags = extractH1Tags(html);

// Affiliate/Commercial signals
signals.has_affiliate_links = /affiliate|\?ref=|\?tag=/i.test(html);
signals.has_sponsored_disclosure = /sponsored\s*content/i.test(html);
signals.has_price_mentions = /\$\d+/i.test(html);
signals.has_buy_cta = /buy now|shop now|get started/i.test(html);

// Review/Comparison signals
signals.has_rating_schema = /"@type"\s*:\s*"Review"/i.test(html);
signals.has_pros_cons = /\b(pros|cons)\b/i.test(html);
signals.has_comparison_table = /\b(vs|versus|compare)\b/i.test(html);

// Expert signals
signals.has_expert_byline = /\b(ph\.?d|m\.?d|ceo|founder)\b/i.test(html);
signals.has_data_citations = /according to|study shows/i.test(html);

// Schema.org types
signals.schema_types = extractSchemaTypes(html);

return signals;
}

Signal-Based Page Type Inference

function inferPageTypeFromContent(signals) {
if (signals.is_press_release_format) return "press_release";
if (signals.has_rating_schema || signals.has_pros_cons) {
if (signals.has_comparison_table) return "comparison_page";
return "review_page";
}
if (signals.schema_types?.includes("faq")) return "faq_page";
if (signals.schema_types?.includes("howto")) return "howto_article";
if (signals.schema_types?.includes("article")) return "blog_post";
return null;
}

SERP Data Optimization

For URLs with SERP data from ranked keywords, content fetch can be skipped:

// Skip expensive content fetch when we have SERP data
skip_content: skip_content || !!(serp_title || serp_description),

This saves ~$0.001/URL for ranking URLs while still providing classification context.


Stage 4: LLM Classification

Purpose

Final fallback using Workers AI (Llama 3.3 70B) for URLs that couldn't be classified with sufficient confidence by earlier stages.

When LLM is Triggered

// LLM needed if:
// 1. No domain_type or very low confidence
result.needs_llm = !result.domain_type.value || result.domain_type.confidence < 50;

// 2. Content parsing didn't help enough
if (result.confidence < 60) {
stage1Result.needs_llm = true;
}

// 3. Rules engine explicitly flagged it
if (stage1Result.needs_llm) {
// Domain type couldn't be determined from patterns
}

Budget Enforcement

LLM calls are budget-capped to prevent cost spirals:

const MAX_DAILY_LLM_BUDGET_USD = 50.0;

const dailyCost = await getDailyServiceCost(env, COST_SERVICES.CF_WORKERS_AI);
if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
return {
skipped: true,
skip_reason: "daily_budget_exceeded",
needs_review: true,
};
}

System Prompt Architecture

The LLM is given a tree-based classification prompt:

=== CLASSIFICATION APPROACH: TWO-LEVEL TREE ===
CRITICAL ORDER:
1. tier1_type FIRST -> constrains domain_type
2. structural_type FIRST -> constrains page_type

=== TIER1_TYPE (7 universal archetypes) ===
- "platform" - SaaS, apps, tools
- "marketplace" - Multi-sided listings
- "commerce" - Direct retail, D2C
- "service" - Sells services
- "information" - Content publishing
- "community" - Forums, social, UGC
- "institutional" - Government, education

Few-Shot Examples

The prompt includes examples to guide classification:

**Example 1: Code Repository**
URL: https://github.com/anthropics/claude-code
-> {"tier1_type": "platform", "domain_type": "code_repository",
"structural_type": "detail", "page_type": "repository_page"}

**Example 2: News Article**
URL: https://www.nytimes.com/2024/01/15/technology/ai-advances.html
-> {"tier1_type": "information", "domain_type": "news_publisher",
"structural_type": "article", "page_type": "news_article"}

Response Parsing and Normalization

LLM outputs are normalized to valid enum values:

const PAGE_TYPE_ALIASES = {
"terms_of_use_page": "legal_terms_page",
"article_page": "blog_post",
"product_detail_page": "product_page",
"application_page": "signup_page",
// ... 50+ aliases
};

function normalizeEnumValue(value, enumObj, aliases) {
const normalized = value.toLowerCase().trim();

// Check exact match
if (isValidEnumValue(normalized, enumObj)) return normalized;

// Check aliases
if (aliases[normalized]) return aliases[normalized];

// Fuzzy matching
return findClosestMatch(normalized, enumObj);
}

Validation Against Constraints

LLM outputs are validated against the tree structure:

// Validate page_type is valid for structural_type
if (!isPageTypeValidForStructural(pageTypeValue, structuralType)) {
const validPageTypes = getValidPageTypesForStructural(structuralType);
const corrected = validPageTypes.find(pt => pt.includes(pageTypeValue.split("_")[0]));
if (corrected) {
pageTypeValue = corrected;
}
}

// Validate domain_type is valid for tier1_type
const validTypes = DOMAIN_TYPES_BY_TIER1[result.tier1_type] || [];
if (!validTypes.includes(result.domain_type)) {
// Find closest valid type
}

Risky Blackhat Rejection

LLM sometimes over-classifies as spam. This is caught and corrected:

if (proposedChannel === "risky_blackhat") {
const isActuallySpammy = isSpammyDomain(url, domain);

if (!isActuallySpammy) {
console.log(`Rejected risky_blackhat for legitimate domain: ${domain}`);
// Derive channel from domain_type instead
result.channel_bucket = deriveChannelFromDomainType(domainType);
}
}

Self-Learning System

Overview

The pipeline gets smarter over time through a feedback loop:

Learning Triggers

// From classification-config.js
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;

// URL learning
if (classification.final_confidence >= LEARNING_MIN_CONFIDENCE) {
await learnFromClassification(input, classification, confidence, env);
}

// Domain learning (stricter threshold)
if (shouldTriggerLearning(confidence, "domain")) {
// Only at 80%+ confidence
}

What Gets Learned

Not all classifications feed the learning loop:

export async function learnFromClassification(input, classification, confidence, env) {
// Skip rule-only classifications (already deterministic)
if (classification.source === "rule") {
return { learned: false, reason: "rule_based" };
}

// Skip spam classifications (would pollute index)
if (domainType === "pbn_suspected" || domainType === "spam_low_quality") {
return { learned: false, reason: "spam_classification" };
}

// Add to Vectorize
const result = await addClassifiedUrl(input, classification, env, "learned");
return { learned: true, ...result };
}

Vectorize Index Updates

async function addClassifiedUrl(input, classification, env, source = "llm") {
// Convert to feature text
const featureText = urlToFeatureText({
url,
domain,
partial_classification: classification,
});

// Generate embedding
const embedding = await getEmbedding(featureText, env);

// Upsert to Vectorize
await upsertVector(
id,
embedding,
{
url,
domain,
classifications: classification,
source,
indexed_at: Date.now(),
},
env
);
}

Confidence Thresholds

Threshold Alignment Philosophy

All thresholds are centralized in classification-config.js with clear rationale:

/**
* RATIONALE FOR THRESHOLD ALIGNMENT:
* - Cache & Pipeline: Same threshold (65%) so cached results behave identically
* to fresh pipeline runs - no "gap" where cache accepts what pipeline wouldn't
* - Learning: Higher threshold (75%) to avoid polluting Vectorize with uncertain
* classifications that could propagate errors to future predictions
* - Domain Learning: Highest threshold (80%) because domain-level classifications
* affect many URLs and errors are more costly
*/

export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;

Per-Stage Thresholds

StageExit ThresholdLearning Threshold
0: Cache>= 60%N/A
1: Rules>= 80% + page_type>= 95%
2: Vectorize>= 70%>= 85%
3: Content>= 60%N/A
4: LLMAlways exits>= 75%

Confidence Calculation

Final confidence is the MAX of all dimension confidences:

function calculateFinalConfidence(classification) {
const confidences = [
getConfidence(classification.domain_type),
getConfidence(classification.tier1_type),
getConfidence(classification.channel_bucket),
getConfidence(classification.page_type),
getConfidence(classification.quality_tier),
].filter(c => c > 0);

return confidences.length > 0 ? Math.max(...confidences) : 0;
}

Cost Optimization

Cost Breakdown

StageCost per URLTypical Hit RateExpected Cost
Cache$040%$0
Rules$050%$0
Vectorize$0.0000130%$0.000003
Content$0.00012525%$0.00003
LLM$0.00025%$0.00001

Expected cost per URL: ~$0.00004 (assuming typical distribution)

Budget Tracking

All costs are tracked via the cost tracking service:

await trackCost(env, {
service: COST_SERVICES.CF_WORKERS_AI,
cost_usd: 0.0002,
units: 1,
metadata: { url, model: LLM_MODEL },
});

await trackCost(env, {
service: COST_SERVICES.DATAFORSEO_INSTANT_PAGES,
cost_usd: actualCost,
units: 1,
metadata: { url },
});

Daily Budget Caps

LLM calls are capped at $50/day:

const MAX_DAILY_LLM_BUDGET_USD = 50.0;

if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
return { skipped: true, skip_reason: "daily_budget_exceeded" };
}

Optimization Strategies

  1. SERP Data Bypass: Skip content fetch for URLs with SERP data
  2. Fetch Method Caching: Remember which domains need ZenRows
  3. Domain Batch Processing: Classify domain once, apply to all URLs
  4. Early Exit: Stop pipeline as soon as confidence threshold met

Queue Processing

Message Types

The URL classify queue handles multiple message types:

// Classify a single URL
{ type: "classify_url", url_id, url, domain, domain_rank }

// Classify a backlink (source + target URL)
{ type: "classify_backlink", backlink_id, source_url_id, source_url, target_url }

// Batch classify all URLs from a domain
{ type: "classify_domain_batch", domain, domain_rank, urls: [...] }

Parallel Processing

Messages are processed in parallel for throughput:

const processingPromises = reorderedMessages.map(async (message) => {
// Process each message
});

const processingResults = await Promise.allSettled(processingPromises);

Domain-Diverse Ordering

Messages are reordered to maximize domain diversity:

function reorderForDomainDiversity(messages) {
// Instead of: [a.com, a.com, a.com, b.com, b.com, c.com]
// We process: [a.com, b.com, c.com, a.com, b.com, a.com]
// This spreads load and reduces per-domain throttle contention
}

Retry Logic

Failed messages use exponential backoff:

if (shouldRetry && attempts < MAX_QUEUE_RETRIES) {
message.retry({
delaySeconds: Math.pow(2, attempts) * 5, // 5s, 10s, 20s
});
}

Files Reference

FilePurpose
url-classifier.jsMain orchestrator - runs all 5 stages
classifier-rules-engine.jsStage 1 - Pattern matching
classifier-vectorize.jsStage 2 - ML similarity
classifier-content-parser.jsStage 3 - Content fetch & signals
classifier-llm.jsStage 4 - LLM classification
classification-config.jsThresholds and configuration
classification-constants.jsEnums and valid values
url-classify-consumer.jsQueue consumer