Skip to main content

Vectorize ML System

RankDisco uses Cloudflare Vectorize for similarity-based classification as Stage 2 of the classification pipeline. This document covers the complete Vectorize ML system, including index architecture, embedding generation, similarity search, and self-learning capabilities.


Vectorize Overview

What is Cloudflare Vectorize?

Cloudflare Vectorize is a globally distributed vector database that enables semantic similarity search at the edge. It stores high-dimensional vectors (embeddings) and performs nearest-neighbor lookups with sub-millisecond latency.

How RankDisco Uses Vectorize

RankDisco leverages Vectorize as a self-improving classification system:

  1. Seed Examples: Manually curated labeled examples bootstrap the system
  2. Similarity Search: New URLs/domains/keywords are compared against known examples
  3. Voting: Top-k similar matches "vote" on classification dimensions
  4. Self-Learning: High-confidence results are fed back to improve future classifications
┌─────────────────────────────────────────────────────────────────────────────┐
│ VECTORIZE CLASSIFICATION FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ INPUT │ │ WORKERS AI │ │ VECTORIZE │ │
│ │ │ │ │ │ │ │
│ │ URL │───▶│ BGE-BASE-EN │───▶│ Query similar vectors │ │
│ │ + metadata │ │ embedding │ │ topK=5, min_score=0.65 │ │
│ │ │ │ (768 dims) │ │ │ │
│ └──────────────┘ └─────────────────┘ └──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ VOTING SYSTEM │ │
│ │ │ │
│ │ Match 1: score=0.92 → domain_type:X │ │
│ │ Match 2: score=0.88 → domain_type:X │ │
│ │ Match 3: score=0.85 → domain_type:Y │ │
│ │ Match 4: score=0.82 → domain_type:X │ │
│ │ Match 5: score=0.78 → domain_type:X │ │
│ │ │ │
│ │ Vote Result: X (weighted 3.40 vs 0.85) │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT │ │
│ │ │ │
│ │ classification: { │ │
│ │ domain_type: { value: "news_publisher", confidence: 82 }, │ │
│ │ channel_bucket: { value: "pr_earned_media", confidence: 78 }, │ │
│ │ page_type: { value: "news_article", confidence: 75 } │ │
│ │ } │ │
│ │ similar_urls: [...top 5 matches with scores] │ │
│ │ needs_llm: false (if confidence >= 70 on all dimensions) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Index Architecture

RankDisco maintains 5 Vectorize namespaces for different classification tasks:

Vectorize Index Configuration

BindingIndex NamePurposeDimensions
VECTORIZErankfabricGeneral purpose vectors768
CATEGORY_VECTORSrankfabric-category-embeddingsCategory classification768
VECTORIZE_BACKLINKbacklink-classifierURL/Backlink classification768
VECTORIZE_DOMAINSdomain-classifierDomain-level classification768
VECTORIZE_KEYWORDSkeyword-classifierKeyword intent classification768

wrangler.toml Configuration

# Vectorize bindings
[[vectorize]]
binding = "VECTORIZE"
index_name = "rankfabric"

[[vectorize]]
binding = "CATEGORY_VECTORS"
index_name = "rankfabric-category-embeddings"

[[vectorize]]
binding = "VECTORIZE_BACKLINK"
index_name = "backlink-classifier"

[[vectorize]]
binding = "VECTORIZE_DOMAINS"
index_name = "domain-classifier"

[[vectorize]]
binding = "VECTORIZE_KEYWORDS"
index_name = "keyword-classifier"

Index Structure

Each vector in the index contains:

{
id: "unique_identifier", // URL/domain/keyword hash (max 64 chars)
values: [0.123, -0.456, ...], // 768-dimensional embedding
metadata: {
url: "https://example.com/page",
domain: "example.com",
classifications: {
domain_type: "news_publisher",
channel_bucket: "pr_earned_media",
page_type: "news_article"
},
source: "seed" | "llm" | "llm_auto" | "manual",
indexed_at: 1704067200000
}
}

Embedding Generation

Workers AI Model

RankDisco uses the BGE-BASE-EN-V1.5 model from Workers AI for generating embeddings:

const EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5";

async function getEmbedding(text, env) {
const response = await env.AI.run(EMBEDDING_MODEL, {
text: [text],
});
return response.data[0]; // 768-dimensional vector
}
PropertyValue
Model@cf/baai/bge-base-en-v1.5
Dimensions768
Cost~$0.00001 per embedding
Latency~50ms

Feature Text Generation

The key to effective similarity search is converting structured data into meaningful text. The urlToFeatureText() function creates a rich text representation:

export function urlToFeatureText(input) {
const parts = [];

// Domain info
parts.push(`domain: ${normalizedDomain}`);
parts.push(`tld: ${tld}`);

// URL path components (meaningful segments)
parts.push(`path: ${pathParts.join(" ")}`);

// Page title (HUGE signal for page type)
// Prefers SERP title over HTML title
parts.push(`title: ${effectiveTitle}`);

// SERP description from Google
parts.push(`snippet: ${serpDescription}`);

// Anchor text context
parts.push(`anchor: ${anchorText}`);

// Platform type from DataForSEO
parts.push(`platform: ${platformType}`);

// Domain authority bucket
parts.push(`authority: tier1|tier2|tier3|tier4|tier5`);

// Partial classification from rules engine
parts.push(`type: ${domainType}`);
parts.push(`channel: ${channelBucket}`);
parts.push(`page: ${pageType}`);

return parts.join(" | ");
}

Example output:

domain: techcrunch.com | tld: com | path: 2024 01 15 startup-raises-series-b |
title: startup xyz raises $50m series b | snippet: tech news startup funding |
platform: news | authority: tier1 | type: news_publisher | channel: pr_earned_media

Query Process

async function querySimilar(embedding, env, topK = 5) {
const index = env.VECTORIZE_BACKLINK;

const results = await index.query(embedding, {
topK,
returnMetadata: true,
});

return results.matches || [];
}

Voting Algorithm

When multiple similar vectors are found, they "vote" on each classification dimension:

function voteOnClassifications(matches) {
const fields = ["domain_type", "channel_bucket", "page_type"];
const votes = {};

for (const field of fields) {
const fieldVotes = {};

for (const match of matches) {
const value = match.metadata?.classifications?.[field];
if (value) {
// Weight by similarity score
fieldVotes[value] = (fieldVotes[value] || 0) + match.score;
}
}

// Find winner with highest weighted score
let winner = null;
let maxScore = 0;
for (const [value, score] of Object.entries(fieldVotes)) {
if (score > maxScore) {
maxScore = score;
winner = value;
}
}

if (winner) {
votes[field] = {
value: winner,
score: maxScore,
totalVotes: Object.keys(fieldVotes).length,
};
}
}

return votes;
}

Vote Weighting

FactorWeightDescription
Similarity ScoreDirectHigher similarity = stronger vote
AgreementBonusFewer alternatives = higher confidence
Match CountThresholdNeed 2+ matches above threshold

Threshold Tuning

Different classification dimensions require different minimum similarity scores:

Dimension-Specific Thresholds

DimensionMin ThresholdRationale
domain_type0.75Domain classification is broad, moderate similarity needed
page_type0.80Page type is URL-specific, needs high similarity
tactic_type0.70Marketing tactics can be broadly similar
Absolute minimum0.65Below this, similarity is too low to trust
const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75;
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80;
const MIN_SIMILARITY_FOR_TACTIC = 0.70;
const MIN_SIMILARITY_FOR_ANY = 0.65;

Confidence Calculation

Confidence is calculated per-dimension based on:

const calcDimensionConfidence = (vote, minSimilarity) => {
if (!vote || bestScore < minSimilarity) return 0;

// Normalize similarity above threshold to 0-1
const similarityFactor = (bestScore - minSimilarity) / (1 - minSimilarity);

// Fewer alternatives = higher confidence
const agreementFactor = 1 / vote.totalVotes;

// Base 50% + up to 30% from similarity + 20% from agreement
return Math.round((0.5 + 0.3 * similarityFactor + 0.2 * agreementFactor) * 100);
};

Escalation Thresholds

ConfidenceAction
>= 85%Store, skip LLM, trigger self-learning
70-84%Store, may skip LLM for non-critical dimensions
50-69%Use Vectorize result but needs LLM verification
30-49%Low trust, LLM required
< 30%Vectorize result ignored, full LLM classification

Self-Learning

Vectorize implements a positive feedback loop where high-confidence classifications reinforce the index.

Learning Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│ SELF-LEARNING LOOP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Stage 1 │ │ Stage 2 │ │ Stage 4 │ │
│ │ Rules │──────▶│ Vectorize │──────▶│ LLM │ │
│ │ Engine │ │ Similarity│ │ Fallback │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ HIGH CONFIDENCE? │ │
│ │ (confidence >= 85%) │ │
│ └───────────────────────────────────────┘ │
│ │ │
│ │ YES │
│ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ VECTORIZE FEEDBACK │ │
│ │ │ │
│ │ 1. Generate feature text │ │
│ │ 2. Create embedding │ │
│ │ 3. Upsert to Vectorize index │ │
│ │ 4. Future queries benefit │ │
│ └───────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Learning Sources

SourceConfidence RequirementNotes
seedN/A (manual)Initial curated examples
llm_auto>= 85%Auto-learned from LLM
vectorize_reinforcement>= 85%Self-reinforcement from Vectorize
manualN/AHuman correction/annotation

Learning Implementation

// Self-learning trigger (classifier-vectorize.js)
if (result.final_confidence >= 85 && !result.vectorize_feedback) {
result.vectorize_feedback = {
url: input.url,
domain: input.domain,
classifications: {
tier1_type: getValue(result.tier1_type),
domain_type: getValue(result.domain_type),
page_type: getValue(result.page_type),
},
source: "vectorize_reinforcement",
confidence: result.final_confidence,
};
}

// Add to Vectorize (classifier-llm.js after high-confidence LLM result)
if (vectorizeFeedback && env.VECTORIZE_BACKLINK && result.confidence >= 70) {
await addClassifiedUrl(
{ url, domain: extractDomainFromUrl(url) },
vectorizeFeedback,
env,
"llm_auto"
);
}

Learning Rate Control

Configured in classification-config.js:

export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;  // 80% for domains
export const URL_LEARNING_MIN_CONFIDENCE = 85; // 85% for URLs

export function shouldTriggerLearning(confidence, type = "url") {
const threshold = type === "domain"
? DOMAIN_LEARNING_MIN_CONFIDENCE
: URL_LEARNING_MIN_CONFIDENCE;
return confidence >= threshold;
}

Index Management

Adding Vectors

// Add a single classified URL
export async function addClassifiedUrl(input, classification, env, source = "llm") {
const { url, domain, domain_rank } = input;

// Generate feature text
const featureText = urlToFeatureText({
url,
domain,
domain_rank,
partial_classification: classification,
});

// Get embedding
const embedding = await getEmbedding(featureText, env);

// Create unique ID (max 64 chars)
const id = url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);

// Upsert to Vectorize
await upsertVector(
id,
embedding,
{
url,
domain,
classifications: classification,
source,
indexed_at: Date.now(),
},
env
);

return { success: true, id };
}

Initializing Seed Examples

export async function initializeSeedExamples(env) {
const results = {
total: SEED_EXAMPLES.length,
indexed: 0,
errors: [],
};

for (const example of SEED_EXAMPLES) {
const featureText = seedExampleToFeatureText(example);
const embedding = await getEmbedding(featureText, env);
const id = example.url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);

await upsertVector(
id,
embedding,
{
url: example.url,
domain: example.domain,
classifications: example.classifications,
source: "seed",
},
env
);

results.indexed++;
}

return results;
}

Bulk Loading from Database

// Load high-confidence domains from D1 into Vectorize
POST /api/admin/classifier/bulk-load-vectorize
Query params:
- min_confidence: Minimum tier1_confidence (default: 0.85)
- limit: Max domains to load (default: 1000)
- offset: Pagination offset (default: 0)
- tier1_type: Filter by specific tier1_type (optional)

Getting Index Stats

export async function getIndexStats(env) {
const index = env.VECTORIZE_BACKLINK;

const info = await index.describe();
return {
dimensions: info.dimensions, // 768
count: info.vectorCount, // number of vectors
};
}

Admin API Endpoints

EndpointMethodDescription
/api/admin/classifier/init-vectorizePOSTInitialize with seed examples
/api/admin/classifier/bulk-load-vectorizePOSTLoad classified domains from DB
/api/admin/classifier/statsGETGet index stats
/api/admin/classifier/learnPOSTManually add classification to index
/api/admin/classifier/seed-keyword-vectorizePOSTInitialize keyword index
/api/admin/classifier/keyword-vectorize-statsGETKeyword index stats

Performance

Query Latency

OperationLatencyNotes
Embedding generation~50msWorkers AI
Vectorize query~10msGlobal edge
Total Stage 2~60-70msEnd-to-end

Batch Operations

Vectorize supports batch operations for efficiency:

// Batch insert (up to 100 vectors)
await index.insert([
{ id: "id1", values: [...], metadata: {...} },
{ id: "id2", values: [...], metadata: {...} },
// ...
]);

// Batch upsert
await index.upsert([...]);

// Batch delete
await index.deleteByIds(["id1", "id2", ...]);

Cost Summary

OperationCostVolume
Workers AI embedding$0.00001Per query
Vectorize queryIncludedFree with Workers
Vectorize storage$0.05/GB-month~1M vectors ~500MB

Total cost per URL classification via Vectorize: ~$0.00001


Seed Examples

The system is bootstrapped with 150+ manually curated seed examples covering:

Coverage by Domain Type

Domain TypeExamples
news_publisherNYT, TechCrunch, Reuters, Bloomberg
saas_productStripe, Notion, Airtable, Figma
blog_publisherMedium, Substack, newsletters
ugc_forum_communityReddit, Quora, Stack Overflow
affiliate_review_siteWirecutter, G2, Capterra
directory_citationYelp, Crunchbase, Yellow Pages
app_platformApp Store, Google Play
government_siteSBA.gov, state sites
education_academicStanford, MIT, Coursera
pbn_suspectedSpam patterns, suspicious TLDs

Seed Example Structure

{
url: "https://techcrunch.com/2024/01/15/startup-raises-50m-series-b/",
domain: "techcrunch.com",
classifications: {
domain_type: "news_publisher",
channel_bucket: "pr_earned_media",
page_type: "news_article",
quality_tier: "tier_2",
},
}

Adding Seed Examples

Edit src/data/vectorize-seed-examples.js:

export const SEED_EXAMPLES = [
// ... existing examples

// Add new example
{
url: "https://newsite.com/page",
domain: "newsite.com",
classifications: {
domain_type: DOMAIN_TYPES.YOUR_TYPE,
channel_bucket: CHANNEL_BUCKETS.YOUR_CHANNEL,
page_type: PAGE_TYPES.YOUR_PAGE_TYPE,
quality_tier: QUALITY_TIERS.TIER_N,
},
},
];

Then reinitialize:

# Via admin API
curl -X POST https://your-worker.workers.dev/api/admin/classifier/init-vectorize

Keyword Vectorize

A parallel system exists for keyword classification using the same architecture:

Feature Text Generation

export function keywordToFeatureText(keywordData, partialClassification = {}) {
const parts = [];

// Keyword itself is primary signal
parts.push(`keyword: ${keyword}`);

// Length signals query specificity
parts.push(`length: short|medium|long`);

// DataForSEO intent
parts.push(`intent: ${searchIntentInfo.main_intent}`);

// Volume bucket
parts.push(`volume: very_high|high|medium|low|very_low`);

// CPC signals commercial value
parts.push(`cpc: very_high|high|medium|low`);

// Partial classification from rules
parts.push(`classified_intent: ${intentType}`);
parts.push(`journey: ${journeyMoment}`);
parts.push(`pattern: ${keywordPattern}`);

return parts.join(" | ");
}

Keyword Classification Dimensions

DimensionDescription
expertise_levelbeginner, intermediate, expert
buyer_behaviorresearcher, evaluator, ready_to_buy
role_contextdeveloper, marketer, executive
journey_momentawareness, consideration, decision
topic_entity_typebrand, product, concept
use_case_typetroubleshooting, learning, comparison

Special Handling

Homepage Validation

The system validates that only root URLs can be classified as "homepage":

if (votedPageType === "homepage") {
const isRootUrl = /^https?:\/\/[^/]+\/?$/.test(url);
if (!isRootUrl) {
// Reject homepage classification for non-root URLs
console.log(`[vectorize] Rejected homepage for non-root: ${url}`);
}
}

Low Similarity Fallback

When similarity is below the absolute minimum threshold:

if (bestScore < MIN_SIMILARITY_FOR_ANY) {
result.final_confidence = Math.round(bestScore * 30); // Low confidence
result.needs_content_parse = true; // Try content parsing
result.needs_llm = true; // Fall back to LLM
result.low_similarity = true;
return result;
}

Files Reference

FilePurpose
src/lib/classification/classifier-vectorize.jsURL/backlink Vectorize classifier
src/lib/keywords/keyword-classifier-vectorize.jsKeyword Vectorize classifier
src/lib/classification/domain-classifier.jsDomain classifier (uses VECTORIZE_DOMAINS)
src/lib/classification/embeddings.jsGeneral embedding helper
src/data/vectorize-seed-examples.jsCurated seed examples for URLs
scripts/seed-keyword-vectorize.jsKeyword seed examples
src/endpoints/admin/admin-classifier.jsAdmin API endpoints

Monitoring & Debugging

Check Index Health

# Get stats
curl https://your-worker.workers.dev/api/admin/classifier/stats

# Response:
{
"seed_examples": {
"total": 157,
"byDomainType": {...},
"byChannel": {...}
},
"unique_domains": 89,
"vectorize_index": {
"dimensions": 768,
"count": 2500
}
}

Debug Classification

# Classify single URL with full trace
curl -X POST https://your-worker.workers.dev/api/admin/classifier/classify \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/page"}'

# Response includes:
{
"similar_urls": [
{ "url": "...", "score": 0.92, "classifications": {...} },
...
],
"classification_source": "rules_vectorize",
"final_confidence": 82
}

Common Issues

IssueCauseSolution
Low confidence scoresNot enough seed examplesAdd more relevant seeds
Wrong classificationsMisleading feature textImprove urlToFeatureText()
Slow queriesLarge indexUse metadata filters
Missing bindingwrangler.toml misconfiguredCheck [[vectorize]] blocks