Vectorize ML System
RankDisco uses Cloudflare Vectorize for similarity-based classification as Stage 2 of the classification pipeline. This document covers the complete Vectorize ML system, including index architecture, embedding generation, similarity search, and self-learning capabilities.
Vectorize Overview
What is Cloudflare Vectorize?
Cloudflare Vectorize is a globally distributed vector database that enables semantic similarity search at the edge. It stores high-dimensional vectors (embeddings) and performs nearest-neighbor lookups with sub-millisecond latency.
How RankDisco Uses Vectorize
RankDisco leverages Vectorize as a self-improving classification system:
- Seed Examples: Manually curated labeled examples bootstrap the system
- Similarity Search: New URLs/domains/keywords are compared against known examples
- Voting: Top-k similar matches "vote" on classification dimensions
- Self-Learning: High-confidence results are fed back to improve future classifications
┌─────────────────────────────────────────────────────────────────────────────┐
│ VECTORIZE CLASSIFICATION FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ INPUT │ │ WORKERS AI │ │ VECTORIZE │ │
│ │ │ │ │ │ │ │
│ │ URL │───▶│ BGE-BASE-EN │───▶│ Query similar vectors │ │
│ │ + metadata │ │ embedding │ │ topK=5, min_score=0.65 │ │
│ │ │ │ (768 dims) │ │ │ │
│ └──────────────┘ └─────────────────┘ └──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ VOTING SYSTEM │ │
│ │ │ │
│ │ Match 1: score=0.92 → domain_type:X │ │
│ │ Match 2: score=0.88 → domain_type:X │ │
│ │ Match 3: score=0.85 → domain_type:Y │ │
│ │ Match 4: score=0.82 → domain_type:X │ │
│ │ Match 5: score=0.78 → domain_type:X │ │
│ │ │ │
│ │ Vote Result: X (weighted 3.40 vs 0.85) │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT │ │
│ │ │ │
│ │ classification: { │ │
│ │ domain_type: { value: "news_publisher", confidence: 82 }, │ │
│ │ channel_bucket: { value: "pr_earned_media", confidence: 78 }, │ │
│ │ page_type: { value: "news_article", confidence: 75 } │ │
│ │ } │ │
│ │ similar_urls: [...top 5 matches with scores] │ │
│ │ needs_llm: false (if confidence >= 70 on all dimensions) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Index Architecture
RankDisco maintains 5 Vectorize namespaces for different classification tasks:
Vectorize Index Configuration
| Binding | Index Name | Purpose | Dimensions |
|---|---|---|---|
VECTORIZE | rankfabric | General purpose vectors | 768 |
CATEGORY_VECTORS | rankfabric-category-embeddings | Category classification | 768 |
VECTORIZE_BACKLINK | backlink-classifier | URL/Backlink classification | 768 |
VECTORIZE_DOMAINS | domain-classifier | Domain-level classification | 768 |
VECTORIZE_KEYWORDS | keyword-classifier | Keyword intent classification | 768 |
wrangler.toml Configuration
# Vectorize bindings
[[vectorize]]
binding = "VECTORIZE"
index_name = "rankfabric"
[[vectorize]]
binding = "CATEGORY_VECTORS"
index_name = "rankfabric-category-embeddings"
[[vectorize]]
binding = "VECTORIZE_BACKLINK"
index_name = "backlink-classifier"
[[vectorize]]
binding = "VECTORIZE_DOMAINS"
index_name = "domain-classifier"
[[vectorize]]
binding = "VECTORIZE_KEYWORDS"
index_name = "keyword-classifier"
Index Structure
Each vector in the index contains:
{
id: "unique_identifier", // URL/domain/keyword hash (max 64 chars)
values: [0.123, -0.456, ...], // 768-dimensional embedding
metadata: {
url: "https://example.com/page",
domain: "example.com",
classifications: {
domain_type: "news_publisher",
channel_bucket: "pr_earned_media",
page_type: "news_article"
},
source: "seed" | "llm" | "llm_auto" | "manual",
indexed_at: 1704067200000
}
}
Embedding Generation
Workers AI Model
RankDisco uses the BGE-BASE-EN-V1.5 model from Workers AI for generating embeddings:
const EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5";
async function getEmbedding(text, env) {
const response = await env.AI.run(EMBEDDING_MODEL, {
text: [text],
});
return response.data[0]; // 768-dimensional vector
}
| Property | Value |
|---|---|
| Model | @cf/baai/bge-base-en-v1.5 |
| Dimensions | 768 |
| Cost | ~$0.00001 per embedding |
| Latency | ~50ms |
Feature Text Generation
The key to effective similarity search is converting structured data into meaningful text. The urlToFeatureText() function creates a rich text representation:
export function urlToFeatureText(input) {
const parts = [];
// Domain info
parts.push(`domain: ${normalizedDomain}`);
parts.push(`tld: ${tld}`);
// URL path components (meaningful segments)
parts.push(`path: ${pathParts.join(" ")}`);
// Page title (HUGE signal for page type)
// Prefers SERP title over HTML title
parts.push(`title: ${effectiveTitle}`);
// SERP description from Google
parts.push(`snippet: ${serpDescription}`);
// Anchor text context
parts.push(`anchor: ${anchorText}`);
// Platform type from DataForSEO
parts.push(`platform: ${platformType}`);
// Domain authority bucket
parts.push(`authority: tier1|tier2|tier3|tier4|tier5`);
// Partial classification from rules engine
parts.push(`type: ${domainType}`);
parts.push(`channel: ${channelBucket}`);
parts.push(`page: ${pageType}`);
return parts.join(" | ");
}
Example output:
domain: techcrunch.com | tld: com | path: 2024 01 15 startup-raises-series-b |
title: startup xyz raises $50m series b | snippet: tech news startup funding |
platform: news | authority: tier1 | type: news_publisher | channel: pr_earned_media
Similarity Search
Query Process
async function querySimilar(embedding, env, topK = 5) {
const index = env.VECTORIZE_BACKLINK;
const results = await index.query(embedding, {
topK,
returnMetadata: true,
});
return results.matches || [];
}
Voting Algorithm
When multiple similar vectors are found, they "vote" on each classification dimension:
function voteOnClassifications(matches) {
const fields = ["domain_type", "channel_bucket", "page_type"];
const votes = {};
for (const field of fields) {
const fieldVotes = {};
for (const match of matches) {
const value = match.metadata?.classifications?.[field];
if (value) {
// Weight by similarity score
fieldVotes[value] = (fieldVotes[value] || 0) + match.score;
}
}
// Find winner with highest weighted score
let winner = null;
let maxScore = 0;
for (const [value, score] of Object.entries(fieldVotes)) {
if (score > maxScore) {
maxScore = score;
winner = value;
}
}
if (winner) {
votes[field] = {
value: winner,
score: maxScore,
totalVotes: Object.keys(fieldVotes).length,
};
}
}
return votes;
}
Vote Weighting
| Factor | Weight | Description |
|---|---|---|
| Similarity Score | Direct | Higher similarity = stronger vote |
| Agreement | Bonus | Fewer alternatives = higher confidence |
| Match Count | Threshold | Need 2+ matches above threshold |
Threshold Tuning
Different classification dimensions require different minimum similarity scores:
Dimension-Specific Thresholds
| Dimension | Min Threshold | Rationale |
|---|---|---|
domain_type | 0.75 | Domain classification is broad, moderate similarity needed |
page_type | 0.80 | Page type is URL-specific, needs high similarity |
tactic_type | 0.70 | Marketing tactics can be broadly similar |
| Absolute minimum | 0.65 | Below this, similarity is too low to trust |
const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75;
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80;
const MIN_SIMILARITY_FOR_TACTIC = 0.70;
const MIN_SIMILARITY_FOR_ANY = 0.65;
Confidence Calculation
Confidence is calculated per-dimension based on:
const calcDimensionConfidence = (vote, minSimilarity) => {
if (!vote || bestScore < minSimilarity) return 0;
// Normalize similarity above threshold to 0-1
const similarityFactor = (bestScore - minSimilarity) / (1 - minSimilarity);
// Fewer alternatives = higher confidence
const agreementFactor = 1 / vote.totalVotes;
// Base 50% + up to 30% from similarity + 20% from agreement
return Math.round((0.5 + 0.3 * similarityFactor + 0.2 * agreementFactor) * 100);
};
Escalation Thresholds
| Confidence | Action |
|---|---|
| >= 85% | Store, skip LLM, trigger self-learning |
| 70-84% | Store, may skip LLM for non-critical dimensions |
| 50-69% | Use Vectorize result but needs LLM verification |
| 30-49% | Low trust, LLM required |
| < 30% | Vectorize result ignored, full LLM classification |
Self-Learning
Vectorize implements a positive feedback loop where high-confidence classifications reinforce the index.
Learning Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ SELF-LEARNING LOOP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Stage 1 │ │ Stage 2 │ │ Stage 4 │ │
│ │ Rules │──────▶│ Vectorize │──────▶│ LLM │ │
│ │ Engine │ │ Similarity│ │ Fallback │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ HIGH CONFIDENCE? │ │
│ │ (confidence >= 85%) │ │
│ └───────────────────────────────────────┘ │
│ │ │
│ │ YES │
│ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ VECTORIZE FEEDBACK │ │
│ │ │ │
│ │ 1. Generate feature text │ │
│ │ 2. Create embedding │ │
│ │ 3. Upsert to Vectorize index │ │
│ │ 4. Future queries benefit │ │
│ └───────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Learning Sources
| Source | Confidence Requirement | Notes |
|---|---|---|
seed | N/A (manual) | Initial curated examples |
llm_auto | >= 85% | Auto-learned from LLM |
vectorize_reinforcement | >= 85% | Self-reinforcement from Vectorize |
manual | N/A | Human correction/annotation |
Learning Implementation
// Self-learning trigger (classifier-vectorize.js)
if (result.final_confidence >= 85 && !result.vectorize_feedback) {
result.vectorize_feedback = {
url: input.url,
domain: input.domain,
classifications: {
tier1_type: getValue(result.tier1_type),
domain_type: getValue(result.domain_type),
page_type: getValue(result.page_type),
},
source: "vectorize_reinforcement",
confidence: result.final_confidence,
};
}
// Add to Vectorize (classifier-llm.js after high-confidence LLM result)
if (vectorizeFeedback && env.VECTORIZE_BACKLINK && result.confidence >= 70) {
await addClassifiedUrl(
{ url, domain: extractDomainFromUrl(url) },
vectorizeFeedback,
env,
"llm_auto"
);
}
Learning Rate Control
Configured in classification-config.js:
export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80; // 80% for domains
export const URL_LEARNING_MIN_CONFIDENCE = 85; // 85% for URLs
export function shouldTriggerLearning(confidence, type = "url") {
const threshold = type === "domain"
? DOMAIN_LEARNING_MIN_CONFIDENCE
: URL_LEARNING_MIN_CONFIDENCE;
return confidence >= threshold;
}
Index Management
Adding Vectors
// Add a single classified URL
export async function addClassifiedUrl(input, classification, env, source = "llm") {
const { url, domain, domain_rank } = input;
// Generate feature text
const featureText = urlToFeatureText({
url,
domain,
domain_rank,
partial_classification: classification,
});
// Get embedding
const embedding = await getEmbedding(featureText, env);
// Create unique ID (max 64 chars)
const id = url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);
// Upsert to Vectorize
await upsertVector(
id,
embedding,
{
url,
domain,
classifications: classification,
source,
indexed_at: Date.now(),
},
env
);
return { success: true, id };
}
Initializing Seed Examples
export async function initializeSeedExamples(env) {
const results = {
total: SEED_EXAMPLES.length,
indexed: 0,
errors: [],
};
for (const example of SEED_EXAMPLES) {
const featureText = seedExampleToFeatureText(example);
const embedding = await getEmbedding(featureText, env);
const id = example.url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);
await upsertVector(
id,
embedding,
{
url: example.url,
domain: example.domain,
classifications: example.classifications,
source: "seed",
},
env
);
results.indexed++;
}
return results;
}
Bulk Loading from Database
// Load high-confidence domains from D1 into Vectorize
POST /api/admin/classifier/bulk-load-vectorize
Query params:
- min_confidence: Minimum tier1_confidence (default: 0.85)
- limit: Max domains to load (default: 1000)
- offset: Pagination offset (default: 0)
- tier1_type: Filter by specific tier1_type (optional)
Getting Index Stats
export async function getIndexStats(env) {
const index = env.VECTORIZE_BACKLINK;
const info = await index.describe();
return {
dimensions: info.dimensions, // 768
count: info.vectorCount, // number of vectors
};
}
Admin API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/admin/classifier/init-vectorize | POST | Initialize with seed examples |
/api/admin/classifier/bulk-load-vectorize | POST | Load classified domains from DB |
/api/admin/classifier/stats | GET | Get index stats |
/api/admin/classifier/learn | POST | Manually add classification to index |
/api/admin/classifier/seed-keyword-vectorize | POST | Initialize keyword index |
/api/admin/classifier/keyword-vectorize-stats | GET | Keyword index stats |
Performance
Query Latency
| Operation | Latency | Notes |
|---|---|---|
| Embedding generation | ~50ms | Workers AI |
| Vectorize query | ~10ms | Global edge |
| Total Stage 2 | ~60-70ms | End-to-end |
Batch Operations
Vectorize supports batch operations for efficiency:
// Batch insert (up to 100 vectors)
await index.insert([
{ id: "id1", values: [...], metadata: {...} },
{ id: "id2", values: [...], metadata: {...} },
// ...
]);
// Batch upsert
await index.upsert([...]);
// Batch delete
await index.deleteByIds(["id1", "id2", ...]);
Cost Summary
| Operation | Cost | Volume |
|---|---|---|
| Workers AI embedding | $0.00001 | Per query |
| Vectorize query | Included | Free with Workers |
| Vectorize storage | $0.05/GB-month | ~1M vectors ~500MB |
Total cost per URL classification via Vectorize: ~$0.00001
Seed Examples
The system is bootstrapped with 150+ manually curated seed examples covering:
Coverage by Domain Type
| Domain Type | Examples |
|---|---|
news_publisher | NYT, TechCrunch, Reuters, Bloomberg |
saas_product | Stripe, Notion, Airtable, Figma |
blog_publisher | Medium, Substack, newsletters |
ugc_forum_community | Reddit, Quora, Stack Overflow |
affiliate_review_site | Wirecutter, G2, Capterra |
directory_citation | Yelp, Crunchbase, Yellow Pages |
app_platform | App Store, Google Play |
government_site | SBA.gov, state sites |
education_academic | Stanford, MIT, Coursera |
pbn_suspected | Spam patterns, suspicious TLDs |
Seed Example Structure
{
url: "https://techcrunch.com/2024/01/15/startup-raises-50m-series-b/",
domain: "techcrunch.com",
classifications: {
domain_type: "news_publisher",
channel_bucket: "pr_earned_media",
page_type: "news_article",
quality_tier: "tier_2",
},
}
Adding Seed Examples
Edit src/data/vectorize-seed-examples.js:
export const SEED_EXAMPLES = [
// ... existing examples
// Add new example
{
url: "https://newsite.com/page",
domain: "newsite.com",
classifications: {
domain_type: DOMAIN_TYPES.YOUR_TYPE,
channel_bucket: CHANNEL_BUCKETS.YOUR_CHANNEL,
page_type: PAGE_TYPES.YOUR_PAGE_TYPE,
quality_tier: QUALITY_TIERS.TIER_N,
},
},
];
Then reinitialize:
# Via admin API
curl -X POST https://your-worker.workers.dev/api/admin/classifier/init-vectorize
Keyword Vectorize
A parallel system exists for keyword classification using the same architecture:
Feature Text Generation
export function keywordToFeatureText(keywordData, partialClassification = {}) {
const parts = [];
// Keyword itself is primary signal
parts.push(`keyword: ${keyword}`);
// Length signals query specificity
parts.push(`length: short|medium|long`);
// DataForSEO intent
parts.push(`intent: ${searchIntentInfo.main_intent}`);
// Volume bucket
parts.push(`volume: very_high|high|medium|low|very_low`);
// CPC signals commercial value
parts.push(`cpc: very_high|high|medium|low`);
// Partial classification from rules
parts.push(`classified_intent: ${intentType}`);
parts.push(`journey: ${journeyMoment}`);
parts.push(`pattern: ${keywordPattern}`);
return parts.join(" | ");
}
Keyword Classification Dimensions
| Dimension | Description |
|---|---|
expertise_level | beginner, intermediate, expert |
buyer_behavior | researcher, evaluator, ready_to_buy |
role_context | developer, marketer, executive |
journey_moment | awareness, consideration, decision |
topic_entity_type | brand, product, concept |
use_case_type | troubleshooting, learning, comparison |
Special Handling
Homepage Validation
The system validates that only root URLs can be classified as "homepage":
if (votedPageType === "homepage") {
const isRootUrl = /^https?:\/\/[^/]+\/?$/.test(url);
if (!isRootUrl) {
// Reject homepage classification for non-root URLs
console.log(`[vectorize] Rejected homepage for non-root: ${url}`);
}
}
Low Similarity Fallback
When similarity is below the absolute minimum threshold:
if (bestScore < MIN_SIMILARITY_FOR_ANY) {
result.final_confidence = Math.round(bestScore * 30); // Low confidence
result.needs_content_parse = true; // Try content parsing
result.needs_llm = true; // Fall back to LLM
result.low_similarity = true;
return result;
}
Files Reference
| File | Purpose |
|---|---|
src/lib/classification/classifier-vectorize.js | URL/backlink Vectorize classifier |
src/lib/keywords/keyword-classifier-vectorize.js | Keyword Vectorize classifier |
src/lib/classification/domain-classifier.js | Domain classifier (uses VECTORIZE_DOMAINS) |
src/lib/classification/embeddings.js | General embedding helper |
src/data/vectorize-seed-examples.js | Curated seed examples for URLs |
scripts/seed-keyword-vectorize.js | Keyword seed examples |
src/endpoints/admin/admin-classifier.js | Admin API endpoints |
Monitoring & Debugging
Check Index Health
# Get stats
curl https://your-worker.workers.dev/api/admin/classifier/stats
# Response:
{
"seed_examples": {
"total": 157,
"byDomainType": {...},
"byChannel": {...}
},
"unique_domains": 89,
"vectorize_index": {
"dimensions": 768,
"count": 2500
}
}
Debug Classification
# Classify single URL with full trace
curl -X POST https://your-worker.workers.dev/api/admin/classifier/classify \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/page"}'
# Response includes:
{
"similar_urls": [
{ "url": "...", "score": 0.92, "classifications": {...} },
...
],
"classification_source": "rules_vectorize",
"final_confidence": 82
}
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Low confidence scores | Not enough seed examples | Add more relevant seeds |
| Wrong classifications | Misleading feature text | Improve urlToFeatureText() |
| Slow queries | Large index | Use metadata filters |
| Missing binding | wrangler.toml misconfigured | Check [[vectorize]] blocks |