Vectorize ML System

RankDisco uses Cloudflare Vectorize for similarity-based classification as Stage 2 of the classification pipeline. This document covers the complete Vectorize ML system, including index architecture, embedding generation, similarity search, and self-learning capabilities.

Vectorize Overview

What is Cloudflare Vectorize?

Cloudflare Vectorize is a globally distributed vector database that enables semantic similarity search at the edge. It stores high-dimensional vectors (embeddings) and performs nearest-neighbor lookups with sub-millisecond latency.

How RankDisco Uses Vectorize

RankDisco leverages Vectorize as a self-improving classification system:

Seed Examples: Manually curated labeled examples bootstrap the system
Similarity Search: New URLs/domains/keywords are compared against known examples
Voting: Top-k similar matches "vote" on classification dimensions
Self-Learning: High-confidence results are fed back to improve future classifications

┌─────────────────────────────────────────────────────────────────────────────┐
│                        VECTORIZE CLASSIFICATION FLOW                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐    ┌─────────────────┐    ┌──────────────────────────┐   │
│  │   INPUT      │    │  WORKERS AI     │    │      VECTORIZE           │   │
│  │              │    │                 │    │                          │   │
│  │  URL         │───▶│  BGE-BASE-EN    │───▶│  Query similar vectors   │   │
│  │  + metadata  │    │  embedding      │    │  topK=5, min_score=0.65  │   │
│  │              │    │  (768 dims)     │    │                          │   │
│  └──────────────┘    └─────────────────┘    └──────────────────────────┘   │
│                                                      │                      │
│                                                      ▼                      │
│                             ┌─────────────────────────────────────────┐     │
│                             │           VOTING SYSTEM                  │     │
│                             │                                          │     │
│                             │  Match 1: score=0.92 → domain_type:X    │     │
│                             │  Match 2: score=0.88 → domain_type:X    │     │
│                             │  Match 3: score=0.85 → domain_type:Y    │     │
│                             │  Match 4: score=0.82 → domain_type:X    │     │
│                             │  Match 5: score=0.78 → domain_type:X    │     │
│                             │                                          │     │
│                             │  Vote Result: X (weighted 3.40 vs 0.85) │     │
│                             └─────────────────────────────────────────┘     │
│                                                      │                      │
│                                                      ▼                      │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                        OUTPUT                                         │  │
│  │                                                                       │  │
│  │  classification: {                                                    │  │
│  │    domain_type: { value: "news_publisher", confidence: 82 },         │  │
│  │    channel_bucket: { value: "pr_earned_media", confidence: 78 },     │  │
│  │    page_type: { value: "news_article", confidence: 75 }              │  │
│  │  }                                                                    │  │
│  │  similar_urls: [...top 5 matches with scores]                        │  │
│  │  needs_llm: false  (if confidence >= 70 on all dimensions)           │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Index Architecture

RankDisco maintains 5 Vectorize namespaces for different classification tasks:

Vectorize Index Configuration

Binding	Index Name	Purpose	Dimensions
`VECTORIZE`	`rankfabric`	General purpose vectors	768
`CATEGORY_VECTORS`	`rankfabric-category-embeddings`	Category classification	768
`VECTORIZE_BACKLINK`	`backlink-classifier`	URL/Backlink classification	768
`VECTORIZE_DOMAINS`	`domain-classifier`	Domain-level classification	768
`VECTORIZE_KEYWORDS`	`keyword-classifier`	Keyword intent classification	768

wrangler.toml Configuration

# Vectorize bindings
[[vectorize]]
binding = "VECTORIZE"
index_name = "rankfabric"

[[vectorize]]
binding = "CATEGORY_VECTORS"
index_name = "rankfabric-category-embeddings"

[[vectorize]]
binding = "VECTORIZE_BACKLINK"
index_name = "backlink-classifier"

[[vectorize]]
binding = "VECTORIZE_DOMAINS"
index_name = "domain-classifier"

[[vectorize]]
binding = "VECTORIZE_KEYWORDS"
index_name = "keyword-classifier"

Index Structure

Each vector in the index contains:

{
  id: "unique_identifier",           // URL/domain/keyword hash (max 64 chars)
  values: [0.123, -0.456, ...],     // 768-dimensional embedding
  metadata: {
    url: "https://example.com/page",
    domain: "example.com",
    classifications: {
      domain_type: "news_publisher",
      channel_bucket: "pr_earned_media",
      page_type: "news_article"
    },
    source: "seed" | "llm" | "llm_auto" | "manual",
    indexed_at: 1704067200000
  }
}

Embedding Generation

Workers AI Model

RankDisco uses the BGE-BASE-EN-V1.5 model from Workers AI for generating embeddings:

const EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5";

async function getEmbedding(text, env) {
  const response = await env.AI.run(EMBEDDING_MODEL, {
    text: [text],
  });
  return response.data[0];  // 768-dimensional vector
}

Property	Value
Model	`@cf/baai/bge-base-en-v1.5`
Dimensions	768
Cost	~$0.00001 per embedding
Latency	~50ms

Feature Text Generation

The key to effective similarity search is converting structured data into meaningful text. The urlToFeatureText() function creates a rich text representation:

export function urlToFeatureText(input) {
  const parts = [];
  
  // Domain info
  parts.push(`domain: ${normalizedDomain}`);
  parts.push(`tld: ${tld}`);
  
  // URL path components (meaningful segments)
  parts.push(`path: ${pathParts.join(" ")}`);
  
  // Page title (HUGE signal for page type)
  // Prefers SERP title over HTML title
  parts.push(`title: ${effectiveTitle}`);
  
  // SERP description from Google
  parts.push(`snippet: ${serpDescription}`);
  
  // Anchor text context
  parts.push(`anchor: ${anchorText}`);
  
  // Platform type from DataForSEO
  parts.push(`platform: ${platformType}`);
  
  // Domain authority bucket
  parts.push(`authority: tier1|tier2|tier3|tier4|tier5`);
  
  // Partial classification from rules engine
  parts.push(`type: ${domainType}`);
  parts.push(`channel: ${channelBucket}`);
  parts.push(`page: ${pageType}`);
  
  return parts.join(" | ");
}

Example output:

domain: techcrunch.com | tld: com | path: 2024 01 15 startup-raises-series-b |
title: startup xyz raises $50m series b | snippet: tech news startup funding |
platform: news | authority: tier1 | type: news_publisher | channel: pr_earned_media

Similarity Search

Query Process

async function querySimilar(embedding, env, topK = 5) {
  const index = env.VECTORIZE_BACKLINK;
  
  const results = await index.query(embedding, {
    topK,
    returnMetadata: true,
  });
  
  return results.matches || [];
}

Voting Algorithm

When multiple similar vectors are found, they "vote" on each classification dimension:

function voteOnClassifications(matches) {
  const fields = ["domain_type", "channel_bucket", "page_type"];
  const votes = {};

  for (const field of fields) {
    const fieldVotes = {};

    for (const match of matches) {
      const value = match.metadata?.classifications?.[field];
      if (value) {
        // Weight by similarity score
        fieldVotes[value] = (fieldVotes[value] || 0) + match.score;
      }
    }

    // Find winner with highest weighted score
    let winner = null;
    let maxScore = 0;
    for (const [value, score] of Object.entries(fieldVotes)) {
      if (score > maxScore) {
        maxScore = score;
        winner = value;
      }
    }

    if (winner) {
      votes[field] = {
        value: winner,
        score: maxScore,
        totalVotes: Object.keys(fieldVotes).length,
      };
    }
  }

  return votes;
}

Vote Weighting

Factor	Weight	Description
Similarity Score	Direct	Higher similarity = stronger vote
Agreement	Bonus	Fewer alternatives = higher confidence
Match Count	Threshold	Need 2+ matches above threshold

Threshold Tuning

Different classification dimensions require different minimum similarity scores:

Dimension-Specific Thresholds

Dimension	Min Threshold	Rationale
`domain_type`	0.75	Domain classification is broad, moderate similarity needed
`page_type`	0.80	Page type is URL-specific, needs high similarity
`tactic_type`	0.70	Marketing tactics can be broadly similar
Absolute minimum	0.65	Below this, similarity is too low to trust

const MIN_SIMILARITY_FOR_DOMAIN_TYPE = 0.75;
const MIN_SIMILARITY_FOR_PAGE_TYPE = 0.80;
const MIN_SIMILARITY_FOR_TACTIC = 0.70;
const MIN_SIMILARITY_FOR_ANY = 0.65;

Confidence Calculation

Confidence is calculated per-dimension based on:

const calcDimensionConfidence = (vote, minSimilarity) => {
  if (!vote || bestScore < minSimilarity) return 0;
  
  // Normalize similarity above threshold to 0-1
  const similarityFactor = (bestScore - minSimilarity) / (1 - minSimilarity);
  
  // Fewer alternatives = higher confidence
  const agreementFactor = 1 / vote.totalVotes;
  
  // Base 50% + up to 30% from similarity + 20% from agreement
  return Math.round((0.5 + 0.3 * similarityFactor + 0.2 * agreementFactor) * 100);
};

Escalation Thresholds

Confidence	Action
>= 85%	Store, skip LLM, trigger self-learning
70-84%	Store, may skip LLM for non-critical dimensions
50-69%	Use Vectorize result but needs LLM verification
30-49%	Low trust, LLM required
< 30%	Vectorize result ignored, full LLM classification

Self-Learning

Vectorize implements a positive feedback loop where high-confidence classifications reinforce the index.

Learning Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           SELF-LEARNING LOOP                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│    ┌────────────┐       ┌────────────┐       ┌────────────┐                │
│    │   Stage 1  │       │   Stage 2  │       │   Stage 4  │                │
│    │   Rules    │──────▶│  Vectorize │──────▶│    LLM     │                │
│    │  Engine    │       │  Similarity│       │  Fallback  │                │
│    └────────────┘       └────────────┘       └────────────┘                │
│                               │                    │                        │
│                               │                    │                        │
│                               ▼                    ▼                        │
│                    ┌───────────────────────────────────────┐               │
│                    │        HIGH CONFIDENCE?               │               │
│                    │        (confidence >= 85%)            │               │
│                    └───────────────────────────────────────┘               │
│                               │                                             │
│                               │ YES                                         │
│                               ▼                                             │
│                    ┌───────────────────────────────────────┐               │
│                    │      VECTORIZE FEEDBACK               │               │
│                    │                                        │               │
│                    │  1. Generate feature text              │               │
│                    │  2. Create embedding                   │               │
│                    │  3. Upsert to Vectorize index         │               │
│                    │  4. Future queries benefit             │               │
│                    └───────────────────────────────────────┘               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Learning Sources

Source	Confidence Requirement	Notes
`seed`	N/A (manual)	Initial curated examples
`llm_auto`	>= 85%	Auto-learned from LLM
`vectorize_reinforcement`	>= 85%	Self-reinforcement from Vectorize
`manual`	N/A	Human correction/annotation

Learning Implementation

// Self-learning trigger (classifier-vectorize.js)
if (result.final_confidence >= 85 && !result.vectorize_feedback) {
  result.vectorize_feedback = {
    url: input.url,
    domain: input.domain,
    classifications: {
      tier1_type: getValue(result.tier1_type),
      domain_type: getValue(result.domain_type),
      page_type: getValue(result.page_type),
    },
    source: "vectorize_reinforcement",
    confidence: result.final_confidence,
  };
}

// Add to Vectorize (classifier-llm.js after high-confidence LLM result)
if (vectorizeFeedback && env.VECTORIZE_BACKLINK && result.confidence >= 70) {
  await addClassifiedUrl(
    { url, domain: extractDomainFromUrl(url) },
    vectorizeFeedback,
    env,
    "llm_auto"
  );
}

Learning Rate Control

Configured in classification-config.js:

export const DOMAIN_LEARNING_MIN_CONFIDENCE = 80;  // 80% for domains
export const URL_LEARNING_MIN_CONFIDENCE = 85;     // 85% for URLs

export function shouldTriggerLearning(confidence, type = "url") {
  const threshold = type === "domain" 
    ? DOMAIN_LEARNING_MIN_CONFIDENCE 
    : URL_LEARNING_MIN_CONFIDENCE;
  return confidence >= threshold;
}

Index Management

Adding Vectors

// Add a single classified URL
export async function addClassifiedUrl(input, classification, env, source = "llm") {
  const { url, domain, domain_rank } = input;

  // Generate feature text
  const featureText = urlToFeatureText({
    url,
    domain,
    domain_rank,
    partial_classification: classification,
  });

  // Get embedding
  const embedding = await getEmbedding(featureText, env);
  
  // Create unique ID (max 64 chars)
  const id = url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);

  // Upsert to Vectorize
  await upsertVector(
    id,
    embedding,
    {
      url,
      domain,
      classifications: classification,
      source,
      indexed_at: Date.now(),
    },
    env
  );

  return { success: true, id };
}

Initializing Seed Examples

export async function initializeSeedExamples(env) {
  const results = {
    total: SEED_EXAMPLES.length,
    indexed: 0,
    errors: [],
  };

  for (const example of SEED_EXAMPLES) {
    const featureText = seedExampleToFeatureText(example);
    const embedding = await getEmbedding(featureText, env);
    const id = example.url.replace(/[^a-zA-Z0-9]/g, "_").slice(0, 64);

    await upsertVector(
      id,
      embedding,
      {
        url: example.url,
        domain: example.domain,
        classifications: example.classifications,
        source: "seed",
      },
      env
    );
    
    results.indexed++;
  }

  return results;
}

Bulk Loading from Database

// Load high-confidence domains from D1 into Vectorize
POST /api/admin/classifier/bulk-load-vectorize
Query params:
  - min_confidence: Minimum tier1_confidence (default: 0.85)
  - limit: Max domains to load (default: 1000)
  - offset: Pagination offset (default: 0)
  - tier1_type: Filter by specific tier1_type (optional)

Getting Index Stats

export async function getIndexStats(env) {
  const index = env.VECTORIZE_BACKLINK;
  
  const info = await index.describe();
  return {
    dimensions: info.dimensions,    // 768
    count: info.vectorCount,        // number of vectors
  };
}

Admin API Endpoints

Endpoint	Method	Description
`/api/admin/classifier/init-vectorize`	POST	Initialize with seed examples
`/api/admin/classifier/bulk-load-vectorize`	POST	Load classified domains from DB
`/api/admin/classifier/stats`	GET	Get index stats
`/api/admin/classifier/learn`	POST	Manually add classification to index
`/api/admin/classifier/seed-keyword-vectorize`	POST	Initialize keyword index
`/api/admin/classifier/keyword-vectorize-stats`	GET	Keyword index stats

Performance

Query Latency

Operation	Latency	Notes
Embedding generation	~50ms	Workers AI
Vectorize query	~10ms	Global edge
Total Stage 2	~60-70ms	End-to-end

Batch Operations

Vectorize supports batch operations for efficiency:

// Batch insert (up to 100 vectors)
await index.insert([
  { id: "id1", values: [...], metadata: {...} },
  { id: "id2", values: [...], metadata: {...} },
  // ...
]);

// Batch upsert
await index.upsert([...]);

// Batch delete
await index.deleteByIds(["id1", "id2", ...]);

Cost Summary

Operation	Cost	Volume
Workers AI embedding	$0.00001	Per query
Vectorize query	Included	Free with Workers
Vectorize storage	$0.05/GB-month	~1M vectors ~500MB

Total cost per URL classification via Vectorize: ~$0.00001

Seed Examples

The system is bootstrapped with 150+ manually curated seed examples covering:

Coverage by Domain Type

Domain Type	Examples
`news_publisher`	NYT, TechCrunch, Reuters, Bloomberg
`saas_product`	Stripe, Notion, Airtable, Figma
`blog_publisher`	Medium, Substack, newsletters
`ugc_forum_community`	Reddit, Quora, Stack Overflow
`affiliate_review_site`	Wirecutter, G2, Capterra
`directory_citation`	Yelp, Crunchbase, Yellow Pages
`app_platform`	App Store, Google Play
`government_site`	SBA.gov, state sites
`education_academic`	Stanford, MIT, Coursera
`pbn_suspected`	Spam patterns, suspicious TLDs

Seed Example Structure

{
  url: "https://techcrunch.com/2024/01/15/startup-raises-50m-series-b/",
  domain: "techcrunch.com",
  classifications: {
    domain_type: "news_publisher",
    channel_bucket: "pr_earned_media",
    page_type: "news_article",
    quality_tier: "tier_2",
  },
}

Adding Seed Examples

Edit src/data/vectorize-seed-examples.js:

export const SEED_EXAMPLES = [
  // ... existing examples
  
  // Add new example
  {
    url: "https://newsite.com/page",
    domain: "newsite.com",
    classifications: {
      domain_type: DOMAIN_TYPES.YOUR_TYPE,
      channel_bucket: CHANNEL_BUCKETS.YOUR_CHANNEL,
      page_type: PAGE_TYPES.YOUR_PAGE_TYPE,
      quality_tier: QUALITY_TIERS.TIER_N,
    },
  },
];

Then reinitialize:

# Via admin API
curl -X POST https://your-worker.workers.dev/api/admin/classifier/init-vectorize

Keyword Vectorize

A parallel system exists for keyword classification using the same architecture:

Feature Text Generation

export function keywordToFeatureText(keywordData, partialClassification = {}) {
  const parts = [];

  // Keyword itself is primary signal
  parts.push(`keyword: ${keyword}`);

  // Length signals query specificity
  parts.push(`length: short|medium|long`);

  // DataForSEO intent
  parts.push(`intent: ${searchIntentInfo.main_intent}`);

  // Volume bucket
  parts.push(`volume: very_high|high|medium|low|very_low`);

  // CPC signals commercial value
  parts.push(`cpc: very_high|high|medium|low`);

  // Partial classification from rules
  parts.push(`classified_intent: ${intentType}`);
  parts.push(`journey: ${journeyMoment}`);
  parts.push(`pattern: ${keywordPattern}`);

  return parts.join(" | ");
}

Keyword Classification Dimensions

Dimension	Description
`expertise_level`	beginner, intermediate, expert
`buyer_behavior`	researcher, evaluator, ready_to_buy
`role_context`	developer, marketer, executive
`journey_moment`	awareness, consideration, decision
`topic_entity_type`	brand, product, concept
`use_case_type`	troubleshooting, learning, comparison

Special Handling

Homepage Validation

The system validates that only root URLs can be classified as "homepage":

if (votedPageType === "homepage") {
  const isRootUrl = /^https?:\/\/[^/]+\/?$/.test(url);
  if (!isRootUrl) {
    // Reject homepage classification for non-root URLs
    console.log(`[vectorize] Rejected homepage for non-root: ${url}`);
  }
}

Low Similarity Fallback

When similarity is below the absolute minimum threshold:

if (bestScore < MIN_SIMILARITY_FOR_ANY) {
  result.final_confidence = Math.round(bestScore * 30);  // Low confidence
  result.needs_content_parse = true;  // Try content parsing
  result.needs_llm = true;            // Fall back to LLM
  result.low_similarity = true;
  return result;
}

Files Reference

File	Purpose
`src/lib/classification/classifier-vectorize.js`	URL/backlink Vectorize classifier
`src/lib/keywords/keyword-classifier-vectorize.js`	Keyword Vectorize classifier
`src/lib/classification/domain-classifier.js`	Domain classifier (uses VECTORIZE_DOMAINS)
`src/lib/classification/embeddings.js`	General embedding helper
`src/data/vectorize-seed-examples.js`	Curated seed examples for URLs
`scripts/seed-keyword-vectorize.js`	Keyword seed examples
`src/endpoints/admin/admin-classifier.js`	Admin API endpoints

Monitoring & Debugging

Check Index Health

# Get stats
curl https://your-worker.workers.dev/api/admin/classifier/stats

# Response:
{
  "seed_examples": {
    "total": 157,
    "byDomainType": {...},
    "byChannel": {...}
  },
  "unique_domains": 89,
  "vectorize_index": {
    "dimensions": 768,
    "count": 2500
  }
}

Debug Classification

# Classify single URL with full trace
curl -X POST https://your-worker.workers.dev/api/admin/classifier/classify \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/page"}'

# Response includes:
{
  "similar_urls": [
    { "url": "...", "score": 0.92, "classifications": {...} },
    ...
  ],
  "classification_source": "rules_vectorize",
  "final_confidence": 82
}

Common Issues

Issue	Cause	Solution
Low confidence scores	Not enough seed examples	Add more relevant seeds
Wrong classifications	Misleading feature text	Improve urlToFeatureText()
Slow queries	Large index	Use metadata filters
Missing binding	wrangler.toml misconfigured	Check [[vectorize]] blocks

Vectorize Overview​

What is Cloudflare Vectorize?​

How RankDisco Uses Vectorize​

Index Architecture​

Vectorize Index Configuration​

wrangler.toml Configuration​

Index Structure​

Embedding Generation​

Workers AI Model​

Feature Text Generation​

Similarity Search​

Query Process​

Voting Algorithm​

Vote Weighting​

Threshold Tuning​

Dimension-Specific Thresholds​

Confidence Calculation​

Escalation Thresholds​

Self-Learning​

Learning Flow​

Learning Sources​

Learning Implementation​

Learning Rate Control​

Index Management​

Adding Vectors​

Initializing Seed Examples​

Bulk Loading from Database​

Getting Index Stats​

Admin API Endpoints​

Performance​

Query Latency​

Batch Operations​

Cost Summary​

Seed Examples​

Coverage by Domain Type​

Seed Example Structure​

Adding Seed Examples​

Keyword Vectorize​

Feature Text Generation​

Keyword Classification Dimensions​

Special Handling​

Homepage Validation​

Low Similarity Fallback​

Files Reference​

Monitoring & Debugging​

Check Index Health​

Debug Classification​

Common Issues​

Vectorize Overview

What is Cloudflare Vectorize?

How RankDisco Uses Vectorize

Index Architecture

Vectorize Index Configuration

wrangler.toml Configuration

Index Structure

Embedding Generation

Workers AI Model

Feature Text Generation

Similarity Search

Query Process

Voting Algorithm

Vote Weighting

Threshold Tuning

Dimension-Specific Thresholds

Confidence Calculation

Escalation Thresholds

Self-Learning

Learning Flow

Learning Sources

Learning Implementation

Learning Rate Control

Index Management

Adding Vectors

Initializing Seed Examples

Bulk Loading from Database

Getting Index Stats

Admin API Endpoints

Performance

Query Latency

Batch Operations

Cost Summary

Seed Examples

Coverage by Domain Type

Seed Example Structure

Adding Seed Examples

Keyword Vectorize

Feature Text Generation

Keyword Classification Dimensions

Special Handling

Homepage Validation

Low Similarity Fallback

Files Reference

Monitoring & Debugging

Check Index Health

Debug Classification

Common Issues