URL Classification Pipeline
Overview
The URL classification pipeline is a 5-stage system that progressively classifies URLs from cheapest to most expensive methods. The architecture is designed around cost optimization - most URLs are classified in the free early stages, with expensive LLM calls reserved only for truly ambiguous cases.
Pipeline Philosophy
Cost-Ordered Stages
The pipeline is ordered from free to expensive:
| Stage | Name | Cost | Latency | Purpose |
|---|---|---|---|---|
| 0 | Cache | $0 | ~5ms | Reuse existing domain classifications |
| 1 | Rules | $0 | ~10ms | Pattern matching against 7,000+ known domains |
| 2 | Vectorize | ~$0.00001 | ~50ms | ML similarity to labeled examples |
| 3 | Content | $0-$0.001 | ~500ms | Page fetch and signal extraction |
| 4 | LLM | ~$0.0002 | ~1s | Workers AI classification |
Early Exit Optimization
Each stage can exit early if confidence thresholds are met:
// From classification-config.js
export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;
The pipeline stops as soon as:
- A stage produces >= 65% confidence on core dimensions
- We have a valid
page_typeclassification - The
needs_llmflag is NOT set
Per-Dimension Confidence
Every classification dimension is tracked separately with {value, confidence, source}:
{
domain_type: { value: "saas_product", confidence: 95, source: "domain_database_v2" },
page_type: { value: "blog_post", confidence: 80, source: "url_path_pattern" },
channel_bucket: { value: "owned_content_marketing", confidence: 85, source: "derived_from_domain_type" },
structural_type: { value: "article", confidence: 85, source: "article_blog_pattern" }
}
This allows:
- Targeted improvement of low-confidence dimensions
- Intelligent merging across stages (higher confidence wins)
- Detailed classification audit trails
Stage 0: Cache Lookup
Purpose
Check if the domain has already been classified. Since domain-level attributes (domain_type, tier1_type, channel_bucket) are stable across URLs from the same domain, caching provides massive speedups.
Implementation
// From url-classifier.js
if (!skip_domain_cache && env?.DB && input.domain) {
const domainResult = await getCachedDomainClassification(input.domain, env);
if (domainResult && domainResult.property_type) {
// Cache HIT - use domain classification
cachedDomainClassification = domainResult;
result.stages_run.push({
stage: 0,
name: "domain_cache",
hit: true,
property_type: domainResult.property_type,
confidence: domainResult.classification_confidence,
cost: 0,
});
}
}
Cache Hierarchy
- KV Cache (fastest): Check
LOOKUP_CACHEKV namespace first - D1 Lookup (fallback): Query
domainstable if KV miss - KV Backfill: Store D1 results in KV for future lookups
// KV cache key format
const cacheKey = `domain-class:${domainHash}`;
// TTL: 24 hours for domain classifications
const DOMAIN_CACHE_TTL = 86400;
When to Skip Cache
The skip_domain_cache option bypasses cache when:
- Force-reclassifying a domain
- Testing classification changes
- Domain was recently corrected
Stage 1: Rules Engine
Purpose
Fast, free classification using deterministic pattern matching. This stage handles ~70% of all URLs without any external API calls.
Rule Categories
Rule 1: Owned Domain Detection (100% confidence)
if (target_domain && isOwnedDomain(normalizedDomain, target_domain)) {
result.channel_bucket = {
value: CHANNEL_BUCKETS.OWNED_BRAND_SITE,
confidence: 100,
source: "owned_domain_match",
};
}
Detects:
- Exact match:
company.com===company.com - Subdomain:
blog.company.combelongs tocompany.com - Common variants:
getcompany.com,company.io,trycompany.com
Rule 2: Domain Database (95% confidence)
A curated database of 7,000+ domains with known classifications:
const v2Classification = getDomainClassification(normalizedDomain);
if (v2Classification) {
result.domain_type = {
value: normalizedDomainType,
confidence: 95,
source: "domain_database_v2",
};
}
Database includes major sites like:
- News: nytimes.com, techcrunch.com, theverge.com
- SaaS: notion.so, slack.com, github.com
- E-commerce: amazon.com, shopify.com, etsy.com
- Social: twitter.com, linkedin.com, reddit.com
Rule 3: TLD-Based Rules (95% confidence)
const TLD_RULES = {
".gov": DOMAIN_TYPES.GOVERNMENT_SITE,
".edu": DOMAIN_TYPES.EDUCATION_ACADEMIC,
".mil": DOMAIN_TYPES.GOVERNMENT_SITE,
".org": null, // Needs further classification
".ac.uk": DOMAIN_TYPES.EDUCATION_ACADEMIC,
};
Rule 4: Platform Detection (85% confidence)
Detects known platforms from URL patterns:
const PLATFORM_URL_PATTERNS = {
"reddit.com": {
domain_type: "forum_community",
page_type_regex: [
{ pattern: /\/r\/[^/]+\/comments\//, page_type: "forum_thread" },
{ pattern: /\/r\/[^/]+\/?$/, page_type: "category_index_page" },
]
},
"github.com": {
domain_type: "code_repository",
page_type_regex: [
{ pattern: /\/[^/]+\/[^/]+\/?$/, page_type: "repository_page" },
{ pattern: /\/issues\/\d+/, page_type: "forum_thread" },
]
}
};
Rule 5: URL Path Patterns (70-85% confidence)
Generic path patterns that indicate page type:
| Pattern | Page Type | Confidence |
|---|---|---|
/blog/, /posts/ | blog_post | 85% |
/docs/, /documentation/ | documentation_page | 90% |
/product/, /products/ | product_page | 85% |
/about, /about-us | about_page | 90% |
/pricing | pricing_page | 85% |
/contact | contact_page | 90% |
Rule 6: Spam/Risk Detection (80-90% confidence)
Detects risky URLs that should be flagged:
// Comment spam patterns
if (/#comment-\d+/i.test(url) || /\?replytocom=/i.test(url)) {
return { structural_type: "spam", confidence: 90 };
}
// Suspicious TLDs
const SUSPICIOUS_TLDS = [".tk", ".ml", ".ga", ".cf", ".gq", ".xyz"];
// PBN indicators
const PBN_PATTERNS = [
/^[a-z]+-[a-z]+-[a-z]+-[a-z]+\./, // keyword-stuffed domains
/\d{5,}\./, // long number sequences
];
Structural Type Detection
Critical Architecture: The rules engine classifies structural_type FIRST, then constrains page_type to valid children:
Stage 2: Vectorize (ML Similarity)
Purpose
Find similar URLs from a labeled index and vote on classification. This stage catches URLs that don't match explicit rules but are similar to known examples.
How It Works
Feature Text Generation
URLs are converted to searchable text that captures their characteristics:
function urlToFeatureText(input) {
const parts = [];
// Domain info
parts.push(`domain: ${normalizedDomain}`);
parts.push(`tld: ${tld}`);
// URL path components
parts.push(`path: ${pathParts.join(" ")}`);
// Page title (huge signal)
if (effectiveTitle) {
parts.push(`title: ${cleanTitle}`);
}
// SERP snippet
if (serp_description) {
parts.push(`snippet: ${cleanDesc}`);
}
// Authority level
if (domain_rank >= 80) parts.push("authority: tier1");
return parts.join(" | ");
}
Similarity Thresholds
Different fields require different similarity scores:
const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75; // Domain type needs high similarity
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80; // Page type needs very high similarity
const MIN_SIMILARITY_FOR_ANY = 0.65; // Absolute minimum to use any result
Voting Algorithm
function voteOnClassifications(matches) {
const votes = {};
for (const match of matches) {
const classifications = match.metadata?.classifications || {};
const value = classifications[field];
if (value) {
// Weight by similarity score
votes[value] = (votes[value] || 0) + match.score;
}
}
// Find winner
return Object.entries(votes)
.sort((a, b) => b[1] - a[1])[0];
}
Self-Reinforcement
High-confidence Vectorize results reinforce themselves:
// When Vectorize confidence >= 85%, prepare feedback
if (result.final_confidence >= 85) {
result.vectorize_feedback = {
url: input.url,
domain: input.domain,
classifications: { ... },
source: "vectorize_reinforcement",
};
}
Stage 3: Content Analysis
Purpose
Fetch actual page content and extract classification signals. This stage is expensive (requires HTTP requests) but provides rich signals that rules and Vectorize cannot.
Fetch Strategy
Content is fetched using a tiered fallback system:
Per-Domain Throttling
To avoid hammering individual sites, fetches are throttled per-domain:
return await withDomainThrottle(
domain,
env,
async () => {
return await _fetchWithFallbacksInner(url, domain, env, options);
},
{ maxWaitMs: 1000, retries: 3, skipOnLimit: false }
);
Fetch Method Caching
Successful fetch methods are cached to skip failed approaches:
// If native fetch was blocked before, skip to ZenRows
const cachedMethod = await getCachedFetchMethod(env, domain);
const skipNative = cachedMethod === "zenrows";
// Cache successful method for 7 days
await cacheFetchMethod(env, domain, "native");
Signal Extraction
From raw HTML, we extract classification signals:
function extractSignalsFromHtml(html, url) {
const signals = {};
// Core metadata
signals.title = extractTitle(html);
signals.description = extractMetaDescription(html);
signals.word_count = countWords(html);
signals.h1_tags = extractH1Tags(html);
// Affiliate/Commercial signals
signals.has_affiliate_links = /affiliate|\?ref=|\?tag=/i.test(html);
signals.has_sponsored_disclosure = /sponsored\s*content/i.test(html);
signals.has_price_mentions = /\$\d+/i.test(html);
signals.has_buy_cta = /buy now|shop now|get started/i.test(html);
// Review/Comparison signals
signals.has_rating_schema = /"@type"\s*:\s*"Review"/i.test(html);
signals.has_pros_cons = /\b(pros|cons)\b/i.test(html);
signals.has_comparison_table = /\b(vs|versus|compare)\b/i.test(html);
// Expert signals
signals.has_expert_byline = /\b(ph\.?d|m\.?d|ceo|founder)\b/i.test(html);
signals.has_data_citations = /according to|study shows/i.test(html);
// Schema.org types
signals.schema_types = extractSchemaTypes(html);
return signals;
}
Signal-Based Page Type Inference
function inferPageTypeFromContent(signals) {
if (signals.is_press_release_format) return "press_release";
if (signals.has_rating_schema || signals.has_pros_cons) {
if (signals.has_comparison_table) return "comparison_page";
return "review_page";
}
if (signals.schema_types?.includes("faq")) return "faq_page";
if (signals.schema_types?.includes("howto")) return "howto_article";
if (signals.schema_types?.includes("article")) return "blog_post";
return null;
}
SERP Data Optimization
For URLs with SERP data from ranked keywords, content fetch can be skipped:
// Skip expensive content fetch when we have SERP data
skip_content: skip_content || !!(serp_title || serp_description),
This saves ~$0.001/URL for ranking URLs while still providing classification context.
Stage 4: LLM Classification
Purpose
Final fallback using Workers AI (Llama 3.3 70B) for URLs that couldn't be classified with sufficient confidence by earlier stages.
When LLM is Triggered
// LLM needed if:
// 1. No domain_type or very low confidence
result.needs_llm = !result.domain_type.value || result.domain_type.confidence < 50;
// 2. Content parsing didn't help enough
if (result.confidence < 60) {
stage1Result.needs_llm = true;
}
// 3. Rules engine explicitly flagged it
if (stage1Result.needs_llm) {
// Domain type couldn't be determined from patterns
}
Budget Enforcement
LLM calls are budget-capped to prevent cost spirals:
const MAX_DAILY_LLM_BUDGET_USD = 50.0;
const dailyCost = await getDailyServiceCost(env, COST_SERVICES.CF_WORKERS_AI);
if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
return {
skipped: true,
skip_reason: "daily_budget_exceeded",
needs_review: true,
};
}
System Prompt Architecture
The LLM is given a tree-based classification prompt:
=== CLASSIFICATION APPROACH: TWO-LEVEL TREE ===
CRITICAL ORDER:
1. tier1_type FIRST -> constrains domain_type
2. structural_type FIRST -> constrains page_type
=== TIER1_TYPE (7 universal archetypes) ===
- "platform" - SaaS, apps, tools
- "marketplace" - Multi-sided listings
- "commerce" - Direct retail, D2C
- "service" - Sells services
- "information" - Content publishing
- "community" - Forums, social, UGC
- "institutional" - Government, education
Few-Shot Examples
The prompt includes examples to guide classification:
**Example 1: Code Repository**
URL: https://github.com/anthropics/claude-code
-> {"tier1_type": "platform", "domain_type": "code_repository",
"structural_type": "detail", "page_type": "repository_page"}
**Example 2: News Article**
URL: https://www.nytimes.com/2024/01/15/technology/ai-advances.html
-> {"tier1_type": "information", "domain_type": "news_publisher",
"structural_type": "article", "page_type": "news_article"}
Response Parsing and Normalization
LLM outputs are normalized to valid enum values:
const PAGE_TYPE_ALIASES = {
"terms_of_use_page": "legal_terms_page",
"article_page": "blog_post",
"product_detail_page": "product_page",
"application_page": "signup_page",
// ... 50+ aliases
};
function normalizeEnumValue(value, enumObj, aliases) {
const normalized = value.toLowerCase().trim();
// Check exact match
if (isValidEnumValue(normalized, enumObj)) return normalized;
// Check aliases
if (aliases[normalized]) return aliases[normalized];
// Fuzzy matching
return findClosestMatch(normalized, enumObj);
}
Validation Against Constraints
LLM outputs are validated against the tree structure:
// Validate page_type is valid for structural_type
if (!isPageTypeValidForStructural(pageTypeValue, structuralType)) {
const validPageTypes = getValidPageTypesForStructural(structuralType);
const corrected = validPageTypes.find(pt => pt.includes(pageTypeValue.split("_")[0]));
if (corrected) {
pageTypeValue = corrected;
}
}
// Validate domain_type is valid for tier1_type
const validTypes = DOMAIN_TYPES_BY_TIER1[result.tier1_type] || [];
if (!validTypes.includes(result.domain_type)) {
// Find closest valid type
}
Risky Blackhat Rejection
LLM sometimes over-classifies as spam. This is caught and corrected:
if (proposedChannel === "risky_blackhat") {
const isActuallySpammy = isSpammyDomain(url, domain);
if (!isActuallySpammy) {
console.log(`Rejected risky_blackhat for legitimate domain: ${domain}`);
// Derive channel from domain_type instead
result.channel_bucket = deriveChannelFromDomainType(domainType);
}
}
Self-Learning System
Overview
The pipeline gets smarter over time through a feedback loop:
Learning Triggers
// From classification-config.js
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;
// URL learning
if (classification.final_confidence >= LEARNING_MIN_CONFIDENCE) {
await learnFromClassification(input, classification, confidence, env);
}
// Domain learning (stricter threshold)
if (shouldTriggerLearning(confidence, "domain")) {
// Only at 80%+ confidence
}
What Gets Learned
Not all classifications feed the learning loop:
export async function learnFromClassification(input, classification, confidence, env) {
// Skip rule-only classifications (already deterministic)
if (classification.source === "rule") {
return { learned: false, reason: "rule_based" };
}
// Skip spam classifications (would pollute index)
if (domainType === "pbn_suspected" || domainType === "spam_low_quality") {
return { learned: false, reason: "spam_classification" };
}
// Add to Vectorize
const result = await addClassifiedUrl(input, classification, env, "learned");
return { learned: true, ...result };
}
Vectorize Index Updates
async function addClassifiedUrl(input, classification, env, source = "llm") {
// Convert to feature text
const featureText = urlToFeatureText({
url,
domain,
partial_classification: classification,
});
// Generate embedding
const embedding = await getEmbedding(featureText, env);
// Upsert to Vectorize
await upsertVector(
id,
embedding,
{
url,
domain,
classifications: classification,
source,
indexed_at: Date.now(),
},
env
);
}
Confidence Thresholds
Threshold Alignment Philosophy
All thresholds are centralized in classification-config.js with clear rationale:
/**
* RATIONALE FOR THRESHOLD ALIGNMENT:
* - Cache & Pipeline: Same threshold (65%) so cached results behave identically
* to fresh pipeline runs - no "gap" where cache accepts what pipeline wouldn't
* - Learning: Higher threshold (75%) to avoid polluting Vectorize with uncertain
* classifications that could propagate errors to future predictions
* - Domain Learning: Highest threshold (80%) because domain-level classifications
* affect many URLs and errors are more costly
*/
export const CACHE_MIN_CONFIDENCE = 65;
export const PIPELINE_MIN_CONFIDENCE = 65;
export const LEARNING_MIN_CONFIDENCE = 75;
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;
Per-Stage Thresholds
| Stage | Exit Threshold | Learning Threshold |
|---|---|---|
| 0: Cache | >= 60% | N/A |
| 1: Rules | >= 80% + page_type | >= 95% |
| 2: Vectorize | >= 70% | >= 85% |
| 3: Content | >= 60% | N/A |
| 4: LLM | Always exits | >= 75% |
Confidence Calculation
Final confidence is the MAX of all dimension confidences:
function calculateFinalConfidence(classification) {
const confidences = [
getConfidence(classification.domain_type),
getConfidence(classification.tier1_type),
getConfidence(classification.channel_bucket),
getConfidence(classification.page_type),
getConfidence(classification.quality_tier),
].filter(c => c > 0);
return confidences.length > 0 ? Math.max(...confidences) : 0;
}
Cost Optimization
Cost Breakdown
| Stage | Cost per URL | Typical Hit Rate | Expected Cost |
|---|---|---|---|
| Cache | $0 | 40% | $0 |
| Rules | $0 | 50% | $0 |
| Vectorize | $0.00001 | 30% | $0.000003 |
| Content | $0.000125 | 25% | $0.00003 |
| LLM | $0.0002 | 5% | $0.00001 |
Expected cost per URL: ~$0.00004 (assuming typical distribution)
Budget Tracking
All costs are tracked via the cost tracking service:
await trackCost(env, {
service: COST_SERVICES.CF_WORKERS_AI,
cost_usd: 0.0002,
units: 1,
metadata: { url, model: LLM_MODEL },
});
await trackCost(env, {
service: COST_SERVICES.DATAFORSEO_INSTANT_PAGES,
cost_usd: actualCost,
units: 1,
metadata: { url },
});
Daily Budget Caps
LLM calls are capped at $50/day:
const MAX_DAILY_LLM_BUDGET_USD = 50.0;
if (dailyCost >= MAX_DAILY_LLM_BUDGET_USD) {
return { skipped: true, skip_reason: "daily_budget_exceeded" };
}
Optimization Strategies
- SERP Data Bypass: Skip content fetch for URLs with SERP data
- Fetch Method Caching: Remember which domains need ZenRows
- Domain Batch Processing: Classify domain once, apply to all URLs
- Early Exit: Stop pipeline as soon as confidence threshold met
Queue Processing
Message Types
The URL classify queue handles multiple message types:
// Classify a single URL
{ type: "classify_url", url_id, url, domain, domain_rank }
// Classify a backlink (source + target URL)
{ type: "classify_backlink", backlink_id, source_url_id, source_url, target_url }
// Batch classify all URLs from a domain
{ type: "classify_domain_batch", domain, domain_rank, urls: [...] }
Parallel Processing
Messages are processed in parallel for throughput:
const processingPromises = reorderedMessages.map(async (message) => {
// Process each message
});
const processingResults = await Promise.allSettled(processingPromises);
Domain-Diverse Ordering
Messages are reordered to maximize domain diversity:
function reorderForDomainDiversity(messages) {
// Instead of: [a.com, a.com, a.com, b.com, b.com, c.com]
// We process: [a.com, b.com, c.com, a.com, b.com, a.com]
// This spreads load and reduces per-domain throttle contention
}
Retry Logic
Failed messages use exponential backoff:
if (shouldRetry && attempts < MAX_QUEUE_RETRIES) {
message.retry({
delaySeconds: Math.pow(2, attempts) * 5, // 5s, 10s, 20s
});
}
Files Reference
| File | Purpose |
|---|---|
url-classifier.js | Main orchestrator - runs all 5 stages |
classifier-rules-engine.js | Stage 1 - Pattern matching |
classifier-vectorize.js | Stage 2 - ML similarity |
classifier-content-parser.js | Stage 3 - Content fetch & signals |
classifier-llm.js | Stage 4 - LLM classification |
classification-config.js | Thresholds and configuration |
classification-constants.js | Enums and valid values |
url-classify-consumer.js | Queue consumer |