Skip to main content

D1 Database Schema Documentation

This document provides comprehensive documentation for the RankDisco D1 (SQLite) database schema, including all tables, relationships, indexes, and common query patterns.


Schema Overview

RankDisco uses Cloudflare D1 (SQLite) as its primary database. The schema is designed around three core concepts:

  1. Entity Tables - Core data entities (domains, URLs, keywords, apps, brands)
  2. Classification Tables - Multi-dimensional classification results with confidence scores
  3. Operational Tables - Workflow tracking, cost monitoring, and activity history

Table Categories

CategoryTablesPurpose
Core Entitiesdomains, urls, keywords, brands, appsPrimary data entities
Backlink Intelligencebacklinks, referring_domains, domain_summariesLink graph data
SERP Dataserp_positions, serp_runs, queries, domain_keyword_rankingsSearch ranking data
Classificationclassification_values, keyword_classifications, classification_correctionsTaxonomy and corrections
Social/Brandsocial_accounts, social_content, brand_propertiesSocial media ownership
App Storeapp_store_categories, app_category_rankingsMobile app intelligence
Operationalapi_costs, pipeline_activity, workflow_instances, crawl_runsSystem operations
Subscriptionskeyword_subscriptions, app_subscriptions, domain_subscriptionsScheduled tracking

ER Diagram


Core Tables

domains

The central entity table for all tracked domains. Contains classification results, onboarding status, and aggregated metrics.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
domainTEXTNOT NULL, UNIQUEDomain name (e.g., "example.com")
root_domainTEXTRoot domain for subdomains
tldTEXTTop-level domain (.com, .org, etc.)
is_developerINTEGERDEFAULT 01 if this is an app developer domain
first_seenINTEGERNOT NULLUnix timestamp when first discovered
last_seenINTEGERNOT NULLUnix timestamp when last seen
created_atINTEGERCreation timestamp
updated_atINTEGERLast update timestamp

Classification Columns:

ColumnTypeDescription
tier1_typeTEXTPrimary archetype: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform
tier1_confidenceREALConfidence score (0-100)
tier1_signalsTEXTJSON array of signals used
domain_typeTEXTSpecific domain type (news, blog, magazine, etc.)
quality_tierTEXTpremium, high, medium, low, spam
channel_bucketTEXTorganic, paid, social, referral, etc.
media_typeTEXTowned, earned, shared, paid
ownership_typeTEXTowned, earned, shared, paid
domain_rankINTEGERDataForSEO domain rank (0-100)
is_high_riskINTEGER1 if flagged as risky
high_risk_reasonTEXTReason for high-risk flag

Onboarding Columns:

ColumnTypeDescription
data_tierTEXTcustomer, discovered, global_sample
discovery_sourceTEXTapp_crawl, manual, serp_crawl, backlink_crawl
onboard_statusTEXTpending, complete, failed
onboard_started_atINTEGERTimestamp
onboard_completed_atINTEGERTimestamp
onboard_duration_msINTEGERTotal onboard time
backlinks_fetched_atINTEGERWhen backlinks were last fetched
keywords_fetched_atINTEGERWhen keywords were last fetched
backlinks_countINTEGERDenormalized count
keywords_countINTEGERDenormalized count

Example Data:

INSERT INTO domains (id, domain, tier1_type, quality_tier, domain_rank, data_tier, onboard_status)
VALUES
(1, 'techcrunch.com', 'publisher', 'premium', 85, 'global_sample', 'complete'),
(2, 'shopify.com', 'saas', 'premium', 92, 'customer', 'complete'),
(3, 'reddit.com', 'community', 'premium', 95, 'global_sample', 'complete');

urls

Stores individual URLs with page-level classification and R2 content references.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
urlTEXTNOT NULL, UNIQUEFull URL
domainTEXTNOT NULLDomain name (denormalized)
domain_idINTEGERFKReference to domains.id
url_typeTEXTURL classification type
normalized_urlTEXTNormalized URL for deduplication
first_seenINTEGERNOT NULLFirst discovery timestamp
last_seenINTEGERNOT NULLLast seen timestamp

Classification Columns:

ColumnTypeDescription
structural_typeTEXTTop-level: article, detail, listing, thread, utility, corporate, reference, spam
structural_type_confidenceINTEGERConfidence (0-100)
page_typeTEXTSpecific page type (blog_post, product, category, etc.)
page_type_confidenceREALConfidence score
tactic_typeTEXTMarketing tactic (seo_content, pr, affiliate, etc.)
channel_bucketTEXTTraffic channel
media_typeTEXTPESO classification
quality_tierTEXTInherited or URL-specific quality
domain_rankINTEGERDenormalized from domain

Content Flags:

ColumnTypeDescription
has_videoINTEGER1 if video present
has_audioINTEGER1 if audio present
has_galleryINTEGER1 if image gallery
user_reviews_presentINTEGER1 if user reviews
affiliate_links_presentINTEGER1 if affiliate links
sponsored_disclosure_presentINTEGER1 if sponsored
paywalledINTEGER1 if paywalled
thin_contentINTEGER1 if thin content
auto_generatedINTEGER1 if programmatic

R2 Content Storage:

ColumnTypeDescription
r2_content_keyTEXTR2 key for extracted JSON content
r2_html_keyTEXTR2 key for raw gzipped HTML
content_crawled_atINTEGERWhen crawled
content_sourceTEXTnative, zenrows, instant_pages
content_costREALCost of crawl in USD

Example Data:

INSERT INTO urls (id, url, domain, domain_id, structural_type, page_type, quality_tier)
VALUES
(1, 'https://techcrunch.com/2024/01/15/startup-funding/', 'techcrunch.com', 1, 'article', 'news', 'premium'),
(2, 'https://shopify.com/pricing', 'shopify.com', 2, 'corporate', 'pricing', 'premium'),
(3, 'https://reddit.com/r/startups/comments/abc123', 'reddit.com', 3, 'thread', 'forum_thread', 'high');

keywords

Global keyword table with metrics from DataForSEO and classification dimensions.

ColumnTypeConstraintsDescription
idTEXTPKKeyword ID (hash or auto)
keywordTEXTNOT NULL, UNIQUEKeyword text
project_idTEXTLegacy project association
metricsJSONRaw DataForSEO metrics
category_idTEXTCategory association
savedINTEGERDEFAULT 0User saved flag
parent_seedTEXTParent keyword if expanded
created_atTEXTCreation timestamp

DataForSEO Metrics (via metrics JSON or direct columns):

FieldTypeDescription
search_volumeINTEGERMonthly search volume
cpcREALCost per click
keyword_difficultyINTEGERSEO difficulty (0-100)
competitionREALCompetition level (0-1)
search_intentTEXTDataForSEO intent
monthly_searchesJSONArray of monthly volumes

Classification Columns (prefix: classification_):

ColumnTypeDescription
classification_funnel_stageTEXTawareness, consideration, decision
classification_intent_typeTEXTinformational, navigational, commercial, transactional
classification_brand_classTEXTbrand_own, brand_competitor, generic
classification_keyword_patternTEXTquestion, comparison, superlative, etc.
classification_ranking_difficultyTEXTvery_easy, easy, medium, hard, very_hard
classification_commercial_value_indexREALDerived from volume * CPC

Example Data:

INSERT INTO keywords (id, keyword, search_volume, cpc, keyword_difficulty)
VALUES
('kw_abc123', 'best crm software', 12000, 45.50, 72),
('kw_def456', 'how to start a business', 85000, 8.20, 45),
('kw_ghi789', 'shopify vs woocommerce', 8500, 32.00, 58);

referring_domains

Stores referring domain data from DataForSEO backlinks API.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
target_domainTEXTNOT NULLDomain being analyzed
referring_domainTEXTNOT NULLDomain linking to target
rankINTEGERDomain rank (0-1000 scale)
backlinks_countINTEGERNumber of backlinks from this domain
spam_scoreINTEGERSpam score (0-100)

Link Quality Metrics:

ColumnTypeDescription
broken_backlinksINTEGERBroken link count
broken_pagesINTEGERBroken page count
referring_domains_countINTEGERThis domain's referring domains
referring_ipsINTEGERUnique IPs
referring_subnetsINTEGERUnique subnets

Distribution Data (JSON):

ColumnTypeDescription
links_tldTEXTTLD distribution
links_typesTEXTLink type distribution
links_attributesTEXTLink attribute distribution
links_platform_typesTEXTPlatform type distribution
links_countriesTEXTCountry distribution

Example Data:

INSERT INTO referring_domains (target_domain, referring_domain, rank, backlinks_count, spam_score)
VALUES
('shopify.com', 'forbes.com', 850, 45, 2),
('shopify.com', 'techcrunch.com', 780, 28, 1),
('shopify.com', 'spamsite.xyz', 50, 1000, 95);

Individual backlink records with source/target URL references and classification.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
source_url_idINTEGERNOT NULL, FKURL providing the link
target_url_idINTEGERNOT NULL, FKURL receiving the link
referring_domain_idINTEGERFKSource domain
target_domain_idINTEGERFKTarget domain
anchor_textTEXTLink anchor text
discovered_atINTEGERNOT NULLFirst discovery timestamp
last_seen_atINTEGERNOT NULLLast seen timestamp
is_dofollowINTEGERDEFAULT 11 if dofollow
link_strengthREALCalculated link strength

Referring Source Classification (denormalized):

ColumnTypeDescription
ref_domain_typeTEXTSource domain type
ref_tier1_typeTEXTSource tier1 archetype
ref_channel_bucketTEXTSource channel
ref_media_typeTEXTSource PESO type
ref_page_typeTEXTSource page type
ref_tactic_typeTEXTSource tactic
ref_quality_tierTEXTSource quality
ref_domain_rankINTEGERSource domain rank

Risk Signals:

ColumnTypeDescription
backlink_spam_scoreINTEGERSpam score (0-100)
is_brokenINTEGER1 if broken
is_newINTEGER1 if newly discovered
is_lostINTEGER1 if lost
risk_scoreINTEGERComputed risk (0-100)
risk_factorsTEXTJSON array of factors
semantic_locationTEXTheader, footer, sidebar, body

Classification Tables

classification_values

Single source of truth for all valid classification values (taxonomy).

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
entity_typeTEXTNOT NULLdomain, url, keyword, or all
dimensionTEXTNOT NULLtier1_type, page_type, etc.
valueTEXTNOT NULLThe classification value
display_nameTEXTHuman-readable name
descriptionTEXTHelp text
categoryTEXTGrouping category
sort_orderINTEGERDEFAULT 0Display order
is_activeINTEGERDEFAULT 11 if active

Key Dimensions:

Domain Dimensions:

  • tier1_type: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform, search_engine, infrastructure
  • domain_type: news, blog, magazine, review_site, directory, forum, wiki, q_and_a, video_platform, retail, agency, local_business, university, etc.
  • quality_tier: premium, high, medium, low, spam
  • ownership_type: owned, earned, shared, paid

URL Dimensions:

  • structural_type: article, detail, listing, thread, utility, corporate, reference, spam
  • page_type: blog_post, news, product, category, landing_page, homepage, pricing, documentation, forum_thread, video_page, tool_page, etc.
  • tactic_type: seo_content, thought_leadership, product_marketing, affiliate, ugc, pr, community, programmatic

Keyword Dimensions:

  • journey_moment: learning, evaluation, validation, action, post_purchase
  • intent_type: informational, navigational, commercial, transactional, local
  • expertise_level: novice, intermediate, advanced, expert
  • keyword_pattern: question, comparison, superlative, problem, use_case, brand_modifier, local_modifier

keyword_classifications

Normalized keyword classifications (one row per dimension per keyword).

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
keyword_idINTEGERNOT NULL, FKReference to keywords
dimensionTEXTNOT NULLClassification dimension
valueTEXTThe classification value
confidenceREALConfidence (0-100)
signalsTEXTJSON with reasoning/evidence
sourceTEXTrules, llm, vectorize, hybrid
versionINTEGERDEFAULT 1Classifier version
created_atINTEGERCreation timestamp
updated_atINTEGERUpdate timestamp

Unique Constraint: (keyword_id, dimension)


classification_corrections

Tracks user corrections to classifications for learning.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
entity_typeTEXTNOT NULLdomain, url, keyword
entity_idINTEGERNOT NULLID in respective table
entity_valueTEXTNOT NULLDomain/URL/keyword text
dimensionTEXTNOT NULLWhich dimension
original_valueTEXTWhat classifier said
original_confidenceINTEGERConfidence (0-100)
corrected_valueTEXTNOT NULLCorrect value
corrected_byTEXTUser ID or 'admin'
reasonTEXTCorrection reason
learning_statusTEXTDEFAULT 'pending'pending, learned, skipped

Operational Tables

api_costs

Tracks API costs per request for budget monitoring.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
run_idTEXTFK to crawl_runs
project_idTEXTFK to projects
serviceTEXTNOT NULLdataforseo, zenrows_basic, etc.
endpointTEXTSpecific endpoint
cost_usdREALNOT NULL, DEFAULT 0Cost in USD
request_countINTEGERDEFAULT 1Number of requests
successINTEGERDEFAULT 11 = success
items_returnedINTEGERItems returned
created_atINTEGERNOT NULLTimestamp

Service Types:

  • dataforseo - SERP, backlinks, keywords APIs
  • zenrows_basic - Basic proxy (~$1/1000)
  • zenrows_premium - Residential proxy (~$10/1000)
  • zenrows_js_render - JS rendering (~$5/1000)
  • rss_feed - Free RSS feeds

api_provider_usage

Daily API quota and cost tracking across all providers.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
dateTEXTNOT NULLYYYY-MM-DD
providerTEXTNOT NULLProvider name
requests_countINTEGERDEFAULT 0Request count
units_usedINTEGERDEFAULT 0API units consumed
operationsTEXTJSON breakdown
cost_centsINTEGERDEFAULT 0Cost in cents
quota_limitINTEGERDaily quota
quota_remainingINTEGERRemaining quota

pipeline_activity

Stores activity events for bottleneck detection.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
node_idTEXTNOT NULLDAG node ID
event_typeTEXTNOT NULLstarted, completed, failed, queued
run_idTEXTWorkflow run ID
domainTEXTDomain being processed
latency_msINTEGERTime taken
error_messageTEXTError details
metaTEXTJSON metadata
created_atINTEGERNOT NULLTimestamp

classification_failures

Logs URLs and keywords that failed classification.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
failure_typeTEXTNOT NULLurl or keyword
url_idINTEGERFor URL failures
urlTEXTURL text
domainTEXTDomain
keyword_idINTEGERFor keyword failures
keywordTEXTKeyword text
owner_domain_idINTEGERFKDomain being onboarded
error_messageTEXTError details
retry_countINTEGERDEFAULT 0Retry attempts
created_atINTEGERNOT NULLTimestamp

crawl_runs

Tracks crawl job execution status.

ColumnTypeConstraintsDescription
idTEXTPKRun ID
platformTEXTNOT NULLapple, google_play
deviceTEXTDevice type
started_atINTEGERNOT NULLStart timestamp
completed_atINTEGERCompletion timestamp
statusTEXTNOT NULL, DEFAULT 'pending'pending, in_progress, completed, failed
total_categoriesINTEGERNOT NULLCategories to crawl
categories_completedINTEGERDEFAULT 0Completed count
categories_failedINTEGERDEFAULT 0Failed count
total_apps_foundINTEGERDEFAULT 0Apps discovered
total_cost_usdREALDEFAULT 0Total cost
configTEXTJSON configuration

App Store Tables

apps

Canonical app metadata table.

ColumnTypeConstraintsDescription
idTEXTPKApp ID
platformTEXTNOT NULL, DEFAULT 'apple'apple or google
bundle_idTEXTBundle identifier
track_idTEXTStore track ID
titleTEXTApp title
subtitleTEXTApp subtitle
developer_nameTEXTDeveloper name
developer_idTEXTDeveloper ID
brand_id_intINTEGERFKBrand reference
primary_categoryTEXTMain category
categoriesJSONAll categories
ratingREALAverage rating
rating_countINTEGERNumber of ratings
priceREALPrice
is_freeINTEGERDEFAULT 01 if free
website_urlTEXTDeveloper website
website_domain_idINTEGERFKDomain reference

app_store_categories

Apple and Google Play category definitions.

ColumnTypeConstraintsDescription
idTEXTPKCategory ID
platformTEXTNOT NULLapple or google
typeTEXTNOT NULLgrouping, room, chart, category
nameTEXTNOT NULLCategory name
urlTEXTCategory URL
sectionTEXTapps, games, charts
metadataJSONAdditional data

app_category_rankings

App positions within categories over time.

ColumnTypeConstraintsDescription
idTEXTPKRanking ID
app_idTEXTNOT NULLApp reference
category_idTEXTNOT NULLCategory reference
platformTEXTNOT NULLapple or google
year_weekTEXTWeek identifier (YYYY-WNN)
app_rankINTEGERNOT NULLPosition in category
category_positionINTEGERCategory shelf position
shelf_nameTEXTShelf/grouping name
check_tsINTEGERNOT NULLCheck timestamp
brand_id_intINTEGERFKBrand reference

Social & Brand Tables

social_accounts

Unified account table for all social platforms.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
platformTEXTNOT NULLyoutube, x, instagram, etc.
account_idTEXTNOT NULLPlatform-specific ID
handleTEXT@handle
display_nameTEXTDisplay name
profile_urlTEXTProfile URL
domain_idINTEGERFKOwning domain
brand_idINTEGERFKOwning brand
follower_countINTEGERFollowers
content_countINTEGERContent items

Unique Constraint: (platform, account_id)


brand_properties

Official brand presences on external platforms.

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
domain_idINTEGERFKDomain owner
brand_idTEXTFKBrand owner
platformTEXTNOT NULLyoutube, x, facebook, etc.
property_urlTEXTNOT NULLFull URL
property_idTEXTPlatform ID (@handle, UCxxx)
canonical_idTEXTStable canonical ID
statusTEXTDEFAULT 'pending'pending, verified, rejected
confidenceREALLLM confidence
user_verifiedINTEGERDEFAULT 01 if user confirmed

SERP Tables

serp_positions

SERP ranking positions (denormalized for query performance).

ColumnTypeConstraintsDescription
run_idTEXTNOT NULL, PK (with position)SERP run reference
query_idTEXTNOT NULLQuery reference
project_idTEXTNOT NULLProject reference
keyword_idTEXTNOT NULLKeyword reference
check_tsINTEGERNOT NULLCheck timestamp
positionINTEGERNOT NULLSERP position
domainTEXTNOT NULLRanking domain
domain_idINTEGERFKDomain reference
urlTEXTNOT NULLRanking URL
titleTEXTSERP title
descriptionTEXTSERP description
result_typeTEXTorganic, featured_snippet, etc.

domain_keyword_rankings

Tracks which keywords a domain ranks for (from DataForSEO).

ColumnTypeConstraintsDescription
idINTEGERPK, AUTOINCREMENTInternal ID
domainTEXTNOT NULLDomain being tracked
keyword_idTEXTNOT NULL, FKKeyword reference
year_weekTEXTNOT NULLWeek snapshot (YYYY-WNN)
check_tsINTEGERNOT NULLCheck timestamp
rank_absoluteINTEGERPosition in full SERP
rank_groupINTEGERPosition in organic only
ranking_urlTEXTURL that ranks
etvREALEstimated traffic value
is_featured_snippetINTEGERDEFAULT 01 if featured
location_codeINTEGERDEFAULT 2840Location (US)

Unique Constraint: (domain, keyword_id, year_week, location_code)


Index Strategy

Core Lookup Indexes

-- Domain lookups (most frequent)
CREATE INDEX idx_domains_domain ON domains(domain);
CREATE INDEX idx_domains_tier1_type ON domains(tier1_type);
CREATE INDEX idx_domains_quality_tier ON domains(quality_tier);

-- URL lookups
CREATE INDEX idx_urls_domain_id ON urls(domain_id);
CREATE INDEX idx_urls_page_type ON urls(page_type);
CREATE INDEX idx_urls_structural_type ON urls(structural_type);

-- Keyword lookups
CREATE UNIQUE INDEX idx_keywords_keyword_unique ON keywords(keyword);

Classification Indexes

-- Find low-confidence items for review
CREATE INDEX idx_domains_low_confidence ON domains(domain_type_confidence)
WHERE domain_type_confidence < 70;

CREATE INDEX idx_urls_low_page_type_confidence ON urls(page_type_confidence)
WHERE page_type_confidence < 70;

-- Find unclassified URLs
CREATE INDEX idx_urls_unclassified ON urls(domain_id)
WHERE page_type IS NULL;

Time-Series Indexes

-- Recent data queries
CREATE INDEX idx_urls_created_at ON urls(created_at DESC);
CREATE INDEX idx_backlinks_created_at ON backlinks(created_at DESC);
CREATE INDEX idx_serp_check_ts ON serp_positions(check_ts DESC);

-- Domain keyword rankings time series
CREATE INDEX idx_dkr_domain_week ON domain_keyword_rankings(domain, year_week DESC);

Composite Indexes for Common Queries

-- SERP position lookups
CREATE INDEX idx_serp_kw_proj_ts ON serp_positions(keyword_id, project_id, check_ts DESC);

-- Backlink quality filtering
CREATE INDEX idx_backlinks_quality_media ON backlinks(ref_quality_tier, ref_media_type);
CREATE INDEX idx_backlinks_channel_quality ON backlinks(ref_channel_bucket, ref_quality_tier);

-- App rankings
CREATE INDEX idx_acr_year_week_cat ON app_category_rankings(year_week DESC, category_id);

Common Queries

Domain Analysis

-- Get domain with all classification details
SELECT
id, domain, tier1_type, domain_type, quality_tier, domain_rank,
onboard_status, data_tier, backlinks_count, keywords_count
FROM domains
WHERE domain = 'example.com';

-- Find high-quality publisher domains
SELECT domain, domain_rank, backlinks_count
FROM domains
WHERE tier1_type = 'publisher'
AND quality_tier IN ('premium', 'high')
AND domain_rank > 70
ORDER BY domain_rank DESC
LIMIT 100;

-- Domains needing classification review
SELECT domain, tier1_type, tier1_confidence
FROM domains
WHERE tier1_confidence < 70 OR needs_review = 1
ORDER BY tier1_confidence ASC;

URL Classification

-- Get URLs for a domain by structural type
SELECT url, structural_type, page_type, classification_confidence
FROM urls
WHERE domain_id = ?
ORDER BY structural_type, page_type;

-- Count page types for a domain
SELECT page_type, COUNT(*) as count
FROM urls
WHERE domain_id = ?
GROUP BY page_type
ORDER BY count DESC;

-- Find unclassified URLs
SELECT u.url, d.domain
FROM urls u
JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
LIMIT 1000;
-- Get backlink breakdown by quality tier
SELECT ref_quality_tier, COUNT(*) as count
FROM backlinks
WHERE target_domain_id = ?
GROUP BY ref_quality_tier
ORDER BY count DESC;

-- Find high-value earned media backlinks
SELECT b.*, u.url as source_url
FROM backlinks b
JOIN urls u ON b.source_url_id = u.id
WHERE b.target_domain_id = ?
AND b.ref_media_type = 'earned'
AND b.ref_quality_tier IN ('premium', 'high')
ORDER BY b.ref_domain_rank DESC
LIMIT 100;

-- Risky backlinks needing review
SELECT source_url_id, anchor_text, risk_score, risk_factors
FROM backlinks
WHERE target_domain_id = ?
AND (risk_score > 70 OR is_broken = 1)
ORDER BY risk_score DESC;

Keyword Analysis

-- Get keyword with all classifications
SELECT k.*, kc.dimension, kc.value, kc.confidence
FROM keywords k
LEFT JOIN keyword_classifications kc ON k.id = kc.keyword_id
WHERE k.keyword = 'best crm software';

-- High-value keywords by commercial index
SELECT keyword, search_volume, cpc, classification_commercial_value_index
FROM keywords
WHERE classification_commercial_value_index > 1000
ORDER BY classification_commercial_value_index DESC
LIMIT 100;

-- Keywords by journey stage
SELECT kc.value as journey_moment, COUNT(*) as count
FROM keyword_classifications kc
WHERE kc.dimension = 'journey_moment'
GROUP BY kc.value
ORDER BY count DESC;

Cost Tracking

-- Daily cost by service
SELECT
date(created_at / 1000, 'unixepoch') as date,
service,
SUM(cost_usd) as total_cost,
COUNT(*) as requests
FROM api_costs
WHERE created_at >= strftime('%s', 'now', '-30 days') * 1000
GROUP BY date, service
ORDER BY date DESC, total_cost DESC;

-- Cost by project
SELECT project_id, SUM(cost_usd) as total_cost
FROM api_costs
WHERE project_id IS NOT NULL
GROUP BY project_id
ORDER BY total_cost DESC;

Classification Taxonomy

-- Get all valid page types
SELECT value, display_name, description
FROM classification_values
WHERE entity_type = 'url'
AND dimension = 'page_type'
AND is_active = 1
ORDER BY sort_order;

-- Get page types for a structural type (cascading filter)
SELECT cv.value, cv.display_name
FROM classification_values cv
WHERE cv.entity_type = 'url'
AND cv.dimension = 'page_type'
AND cv.category = 'article' -- Maps to structural_type
AND cv.is_active = 1
ORDER BY cv.sort_order;

Migration History

Notable Schema Changes

MigrationDateDescription
0001InitialInitial schema with topics and embeddings_meta
0003EarlyCategories and keywords tables
010EarlyApps table for mobile app metadata
0042EarlyBrands table with SHA-256 hash IDs
0044EarlySplit subscriptions into type-specific tables
0079Phase 0Classification columns for OEPS model
0086MajorUnified domains/urls/brands with INTEGER IDs
0091BacklinksQuality tier and domain rank columns
0093KeywordsKeyword classification dimensions
0097Tier-1Simplified 7-archetype system
0102ConfidencePer-dimension confidence scores
0107KeywordsNormalized keyword_classifications table
0113SocialBrand properties for ownership resolution
0116SocialUnified social_accounts and social_content
0119LearningClassification corrections table
0131Backlinksreferring_domains table
0135Taxonomyclassification_values master taxonomy
0152-153Hierarchystructural_type parent layer for page_type
0148PerformanceCritical missing indexes

Key Architecture Decisions

  1. INTEGER IDs over TEXT hashes (0086): Switched from SHA-256 hashes to auto-increment integers for better join performance and simpler debugging.

  2. Denormalized Classification (0079, 0091): Classification results stored directly on entities for query performance, avoiding expensive joins.

  3. Per-Dimension Confidence (0102): Individual confidence scores per classification dimension enables targeted LLM verification.

  4. Normalized Keyword Classifications (0107): One row per dimension avoids 50+ column explosion on keywords table.

  5. Structural Type Hierarchy (0152): Added parent layer for page_type to enable cascading filters (structural_type -> page_type).

  6. Unified Social Tables (0116): Single schema for all platforms instead of platform-specific tables (youtube_channels, etc.).


Performance Considerations

D1 Limitations

  • No concurrent writes: D1 uses SQLite, so writes are serialized
  • Row size limits: Keep JSON columns reasonably sized
  • Index overhead: Balance query performance vs write cost

Best Practices

  1. Batch inserts: Use INSERT OR IGNORE with multiple values
  2. Partial indexes: Use WHERE clauses on indexes for sparse data
  3. Denormalization: Store frequently-joined data directly on entities
  4. Avoid N+1: Use JOINs instead of multiple queries

Monitoring Queries

-- Table sizes
SELECT name, COUNT(*) as rows
FROM sqlite_master, (SELECT 1 FROM domains LIMIT 1)
WHERE type = 'table' AND name NOT LIKE 'sqlite_%';

-- Index usage (check with EXPLAIN QUERY PLAN)
EXPLAIN QUERY PLAN
SELECT * FROM urls WHERE domain_id = ? AND page_type = 'blog_post';