D1 Database Schema Documentation

This document provides comprehensive documentation for the RankDisco D1 (SQLite) database schema, including all tables, relationships, indexes, and common query patterns.

Schema Overview

RankDisco uses Cloudflare D1 (SQLite) as its primary database. The schema is designed around three core concepts:

Entity Tables - Core data entities (domains, URLs, keywords, apps, brands)
Classification Tables - Multi-dimensional classification results with confidence scores
Operational Tables - Workflow tracking, cost monitoring, and activity history

Table Categories

Category	Tables	Purpose
Core Entities	`domains`, `urls`, `keywords`, `brands`, `apps`	Primary data entities
Backlink Intelligence	`backlinks`, `referring_domains`, `domain_summaries`	Link graph data
SERP Data	`serp_positions`, `serp_runs`, `queries`, `domain_keyword_rankings`	Search ranking data
Classification	`classification_values`, `keyword_classifications`, `classification_corrections`	Taxonomy and corrections
Social/Brand	`social_accounts`, `social_content`, `brand_properties`	Social media ownership
App Store	`app_store_categories`, `app_category_rankings`	Mobile app intelligence
Operational	`api_costs`, `pipeline_activity`, `workflow_instances`, `crawl_runs`	System operations
Subscriptions	`keyword_subscriptions`, `app_subscriptions`, `domain_subscriptions`	Scheduled tracking

ER Diagram

Core Tables

domains

The central entity table for all tracked domains. Contains classification results, onboarding status, and aggregated metrics.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`domain`	TEXT	NOT NULL, UNIQUE	Domain name (e.g., "example.com")
`root_domain`	TEXT		Root domain for subdomains
`tld`	TEXT		Top-level domain (.com, .org, etc.)
`is_developer`	INTEGER	DEFAULT 0	1 if this is an app developer domain
`first_seen`	INTEGER	NOT NULL	Unix timestamp when first discovered
`last_seen`	INTEGER	NOT NULL	Unix timestamp when last seen
`created_at`	INTEGER		Creation timestamp
`updated_at`	INTEGER		Last update timestamp

Classification Columns:

Column	Type	Description
`tier1_type`	TEXT	Primary archetype: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform
`tier1_confidence`	REAL	Confidence score (0-100)
`tier1_signals`	TEXT	JSON array of signals used
`domain_type`	TEXT	Specific domain type (news, blog, magazine, etc.)
`quality_tier`	TEXT	premium, high, medium, low, spam
`channel_bucket`	TEXT	organic, paid, social, referral, etc.
`media_type`	TEXT	owned, earned, shared, paid
`ownership_type`	TEXT	owned, earned, shared, paid
`domain_rank`	INTEGER	DataForSEO domain rank (0-100)
`is_high_risk`	INTEGER	1 if flagged as risky
`high_risk_reason`	TEXT	Reason for high-risk flag

Onboarding Columns:

Column	Type	Description
`data_tier`	TEXT	customer, discovered, global_sample
`discovery_source`	TEXT	app_crawl, manual, serp_crawl, backlink_crawl
`onboard_status`	TEXT	pending, complete, failed
`onboard_started_at`	INTEGER	Timestamp
`onboard_completed_at`	INTEGER	Timestamp
`onboard_duration_ms`	INTEGER	Total onboard time
`backlinks_fetched_at`	INTEGER	When backlinks were last fetched
`keywords_fetched_at`	INTEGER	When keywords were last fetched
`backlinks_count`	INTEGER	Denormalized count
`keywords_count`	INTEGER	Denormalized count

Example Data:

INSERT INTO domains (id, domain, tier1_type, quality_tier, domain_rank, data_tier, onboard_status)
VALUES 
  (1, 'techcrunch.com', 'publisher', 'premium', 85, 'global_sample', 'complete'),
  (2, 'shopify.com', 'saas', 'premium', 92, 'customer', 'complete'),
  (3, 'reddit.com', 'community', 'premium', 95, 'global_sample', 'complete');

urls

Stores individual URLs with page-level classification and R2 content references.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`url`	TEXT	NOT NULL, UNIQUE	Full URL
`domain`	TEXT	NOT NULL	Domain name (denormalized)
`domain_id`	INTEGER	FK	Reference to domains.id
`url_type`	TEXT		URL classification type
`normalized_url`	TEXT		Normalized URL for deduplication
`first_seen`	INTEGER	NOT NULL	First discovery timestamp
`last_seen`	INTEGER	NOT NULL	Last seen timestamp

Classification Columns:

Column	Type	Description
`structural_type`	TEXT	Top-level: article, detail, listing, thread, utility, corporate, reference, spam
`structural_type_confidence`	INTEGER	Confidence (0-100)
`page_type`	TEXT	Specific page type (blog_post, product, category, etc.)
`page_type_confidence`	REAL	Confidence score
`tactic_type`	TEXT	Marketing tactic (seo_content, pr, affiliate, etc.)
`channel_bucket`	TEXT	Traffic channel
`media_type`	TEXT	PESO classification
`quality_tier`	TEXT	Inherited or URL-specific quality
`domain_rank`	INTEGER	Denormalized from domain

Content Flags:

Column	Type	Description
`has_video`	INTEGER	1 if video present
`has_audio`	INTEGER	1 if audio present
`has_gallery`	INTEGER	1 if image gallery
`user_reviews_present`	INTEGER	1 if user reviews
`affiliate_links_present`	INTEGER	1 if affiliate links
`sponsored_disclosure_present`	INTEGER	1 if sponsored
`paywalled`	INTEGER	1 if paywalled
`thin_content`	INTEGER	1 if thin content
`auto_generated`	INTEGER	1 if programmatic

R2 Content Storage:

Column	Type	Description
`r2_content_key`	TEXT	R2 key for extracted JSON content
`r2_html_key`	TEXT	R2 key for raw gzipped HTML
`content_crawled_at`	INTEGER	When crawled
`content_source`	TEXT	native, zenrows, instant_pages
`content_cost`	REAL	Cost of crawl in USD

Example Data:

INSERT INTO urls (id, url, domain, domain_id, structural_type, page_type, quality_tier)
VALUES
  (1, 'https://techcrunch.com/2024/01/15/startup-funding/', 'techcrunch.com', 1, 'article', 'news', 'premium'),
  (2, 'https://shopify.com/pricing', 'shopify.com', 2, 'corporate', 'pricing', 'premium'),
  (3, 'https://reddit.com/r/startups/comments/abc123', 'reddit.com', 3, 'thread', 'forum_thread', 'high');

keywords

Global keyword table with metrics from DataForSEO and classification dimensions.

Column	Type	Constraints	Description
`id`	TEXT	PK	Keyword ID (hash or auto)
`keyword`	TEXT	NOT NULL, UNIQUE	Keyword text
`project_id`	TEXT		Legacy project association
`metrics`	JSON		Raw DataForSEO metrics
`category_id`	TEXT		Category association
`saved`	INTEGER	DEFAULT 0	User saved flag
`parent_seed`	TEXT		Parent keyword if expanded
`created_at`	TEXT		Creation timestamp

DataForSEO Metrics (via metrics JSON or direct columns):

Field	Type	Description
`search_volume`	INTEGER	Monthly search volume
`cpc`	REAL	Cost per click
`keyword_difficulty`	INTEGER	SEO difficulty (0-100)
`competition`	REAL	Competition level (0-1)
`search_intent`	TEXT	DataForSEO intent
`monthly_searches`	JSON	Array of monthly volumes

Classification Columns (prefix: classification_):

Column	Type	Description
`classification_funnel_stage`	TEXT	awareness, consideration, decision
`classification_intent_type`	TEXT	informational, navigational, commercial, transactional
`classification_brand_class`	TEXT	brand_own, brand_competitor, generic
`classification_keyword_pattern`	TEXT	question, comparison, superlative, etc.
`classification_ranking_difficulty`	TEXT	very_easy, easy, medium, hard, very_hard
`classification_commercial_value_index`	REAL	Derived from volume * CPC

Example Data:

INSERT INTO keywords (id, keyword, search_volume, cpc, keyword_difficulty)
VALUES
  ('kw_abc123', 'best crm software', 12000, 45.50, 72),
  ('kw_def456', 'how to start a business', 85000, 8.20, 45),
  ('kw_ghi789', 'shopify vs woocommerce', 8500, 32.00, 58);

referring_domains

Stores referring domain data from DataForSEO backlinks API.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`target_domain`	TEXT	NOT NULL	Domain being analyzed
`referring_domain`	TEXT	NOT NULL	Domain linking to target
`rank`	INTEGER		Domain rank (0-1000 scale)
`backlinks_count`	INTEGER		Number of backlinks from this domain
`spam_score`	INTEGER		Spam score (0-100)

Link Quality Metrics:

Column	Type	Description
`broken_backlinks`	INTEGER	Broken link count
`broken_pages`	INTEGER	Broken page count
`referring_domains_count`	INTEGER	This domain's referring domains
`referring_ips`	INTEGER	Unique IPs
`referring_subnets`	INTEGER	Unique subnets

Distribution Data (JSON):

Column	Type	Description
`links_tld`	TEXT	TLD distribution
`links_types`	TEXT	Link type distribution
`links_attributes`	TEXT	Link attribute distribution
`links_platform_types`	TEXT	Platform type distribution
`links_countries`	TEXT	Country distribution

Example Data:

INSERT INTO referring_domains (target_domain, referring_domain, rank, backlinks_count, spam_score)
VALUES
  ('shopify.com', 'forbes.com', 850, 45, 2),
  ('shopify.com', 'techcrunch.com', 780, 28, 1),
  ('shopify.com', 'spamsite.xyz', 50, 1000, 95);

backlinks

Individual backlink records with source/target URL references and classification.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`source_url_id`	INTEGER	NOT NULL, FK	URL providing the link
`target_url_id`	INTEGER	NOT NULL, FK	URL receiving the link
`referring_domain_id`	INTEGER	FK	Source domain
`target_domain_id`	INTEGER	FK	Target domain
`anchor_text`	TEXT		Link anchor text
`discovered_at`	INTEGER	NOT NULL	First discovery timestamp
`last_seen_at`	INTEGER	NOT NULL	Last seen timestamp
`is_dofollow`	INTEGER	DEFAULT 1	1 if dofollow
`link_strength`	REAL		Calculated link strength

Referring Source Classification (denormalized):

Column	Type	Description
`ref_domain_type`	TEXT	Source domain type
`ref_tier1_type`	TEXT	Source tier1 archetype
`ref_channel_bucket`	TEXT	Source channel
`ref_media_type`	TEXT	Source PESO type
`ref_page_type`	TEXT	Source page type
`ref_tactic_type`	TEXT	Source tactic
`ref_quality_tier`	TEXT	Source quality
`ref_domain_rank`	INTEGER	Source domain rank

Risk Signals:

Column	Type	Description
`backlink_spam_score`	INTEGER	Spam score (0-100)
`is_broken`	INTEGER	1 if broken
`is_new`	INTEGER	1 if newly discovered
`is_lost`	INTEGER	1 if lost
`risk_score`	INTEGER	Computed risk (0-100)
`risk_factors`	TEXT	JSON array of factors
`semantic_location`	TEXT	header, footer, sidebar, body

Classification Tables

classification_values

Single source of truth for all valid classification values (taxonomy).

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`entity_type`	TEXT	NOT NULL	domain, url, keyword, or all
`dimension`	TEXT	NOT NULL	tier1_type, page_type, etc.
`value`	TEXT	NOT NULL	The classification value
`display_name`	TEXT		Human-readable name
`description`	TEXT		Help text
`category`	TEXT		Grouping category
`sort_order`	INTEGER	DEFAULT 0	Display order
`is_active`	INTEGER	DEFAULT 1	1 if active

Key Dimensions:

Domain Dimensions:

tier1_type: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform, search_engine, infrastructure
domain_type: news, blog, magazine, review_site, directory, forum, wiki, q_and_a, video_platform, retail, agency, local_business, university, etc.
quality_tier: premium, high, medium, low, spam
ownership_type: owned, earned, shared, paid

URL Dimensions:

structural_type: article, detail, listing, thread, utility, corporate, reference, spam
page_type: blog_post, news, product, category, landing_page, homepage, pricing, documentation, forum_thread, video_page, tool_page, etc.
tactic_type: seo_content, thought_leadership, product_marketing, affiliate, ugc, pr, community, programmatic

Keyword Dimensions:

journey_moment: learning, evaluation, validation, action, post_purchase
intent_type: informational, navigational, commercial, transactional, local
expertise_level: novice, intermediate, advanced, expert
keyword_pattern: question, comparison, superlative, problem, use_case, brand_modifier, local_modifier

keyword_classifications

Normalized keyword classifications (one row per dimension per keyword).

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`keyword_id`	INTEGER	NOT NULL, FK	Reference to keywords
`dimension`	TEXT	NOT NULL	Classification dimension
`value`	TEXT		The classification value
`confidence`	REAL		Confidence (0-100)
`signals`	TEXT		JSON with reasoning/evidence
`source`	TEXT		rules, llm, vectorize, hybrid
`version`	INTEGER	DEFAULT 1	Classifier version
`created_at`	INTEGER		Creation timestamp
`updated_at`	INTEGER		Update timestamp

Unique Constraint: (keyword_id, dimension)

classification_corrections

Tracks user corrections to classifications for learning.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`entity_type`	TEXT	NOT NULL	domain, url, keyword
`entity_id`	INTEGER	NOT NULL	ID in respective table
`entity_value`	TEXT	NOT NULL	Domain/URL/keyword text
`dimension`	TEXT	NOT NULL	Which dimension
`original_value`	TEXT		What classifier said
`original_confidence`	INTEGER		Confidence (0-100)
`corrected_value`	TEXT	NOT NULL	Correct value
`corrected_by`	TEXT		User ID or 'admin'
`reason`	TEXT		Correction reason
`learning_status`	TEXT	DEFAULT 'pending'	pending, learned, skipped

Operational Tables

api_costs

Tracks API costs per request for budget monitoring.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`run_id`	TEXT		FK to crawl_runs
`project_id`	TEXT		FK to projects
`service`	TEXT	NOT NULL	dataforseo, zenrows_basic, etc.
`endpoint`	TEXT		Specific endpoint
`cost_usd`	REAL	NOT NULL, DEFAULT 0	Cost in USD
`request_count`	INTEGER	DEFAULT 1	Number of requests
`success`	INTEGER	DEFAULT 1	1 = success
`items_returned`	INTEGER		Items returned
`created_at`	INTEGER	NOT NULL	Timestamp

Service Types:

dataforseo - SERP, backlinks, keywords APIs
zenrows_basic - Basic proxy (~$1/1000)
zenrows_premium - Residential proxy (~$10/1000)
zenrows_js_render - JS rendering (~$5/1000)
rss_feed - Free RSS feeds

api_provider_usage

Daily API quota and cost tracking across all providers.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`date`	TEXT	NOT NULL	YYYY-MM-DD
`provider`	TEXT	NOT NULL	Provider name
`requests_count`	INTEGER	DEFAULT 0	Request count
`units_used`	INTEGER	DEFAULT 0	API units consumed
`operations`	TEXT		JSON breakdown
`cost_cents`	INTEGER	DEFAULT 0	Cost in cents
`quota_limit`	INTEGER		Daily quota
`quota_remaining`	INTEGER		Remaining quota

pipeline_activity

Stores activity events for bottleneck detection.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`node_id`	TEXT	NOT NULL	DAG node ID
`event_type`	TEXT	NOT NULL	started, completed, failed, queued
`run_id`	TEXT		Workflow run ID
`domain`	TEXT		Domain being processed
`latency_ms`	INTEGER		Time taken
`error_message`	TEXT		Error details
`meta`	TEXT		JSON metadata
`created_at`	INTEGER	NOT NULL	Timestamp

classification_failures

Logs URLs and keywords that failed classification.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`failure_type`	TEXT	NOT NULL	url or keyword
`url_id`	INTEGER		For URL failures
`url`	TEXT		URL text
`domain`	TEXT		Domain
`keyword_id`	INTEGER		For keyword failures
`keyword`	TEXT		Keyword text
`owner_domain_id`	INTEGER	FK	Domain being onboarded
`error_message`	TEXT		Error details
`retry_count`	INTEGER	DEFAULT 0	Retry attempts
`created_at`	INTEGER	NOT NULL	Timestamp

crawl_runs

Tracks crawl job execution status.

Column	Type	Constraints	Description
`id`	TEXT	PK	Run ID
`platform`	TEXT	NOT NULL	apple, google_play
`device`	TEXT		Device type
`started_at`	INTEGER	NOT NULL	Start timestamp
`completed_at`	INTEGER		Completion timestamp
`status`	TEXT	NOT NULL, DEFAULT 'pending'	pending, in_progress, completed, failed
`total_categories`	INTEGER	NOT NULL	Categories to crawl
`categories_completed`	INTEGER	DEFAULT 0	Completed count
`categories_failed`	INTEGER	DEFAULT 0	Failed count
`total_apps_found`	INTEGER	DEFAULT 0	Apps discovered
`total_cost_usd`	REAL	DEFAULT 0	Total cost
`config`	TEXT		JSON configuration

App Store Tables

apps

Canonical app metadata table.

Column	Type	Constraints	Description
`id`	TEXT	PK	App ID
`platform`	TEXT	NOT NULL, DEFAULT 'apple'	apple or google
`bundle_id`	TEXT		Bundle identifier
`track_id`	TEXT		Store track ID
`title`	TEXT		App title
`subtitle`	TEXT		App subtitle
`developer_name`	TEXT		Developer name
`developer_id`	TEXT		Developer ID
`brand_id_int`	INTEGER	FK	Brand reference
`primary_category`	TEXT		Main category
`categories`	JSON		All categories
`rating`	REAL		Average rating
`rating_count`	INTEGER		Number of ratings
`price`	REAL		Price
`is_free`	INTEGER	DEFAULT 0	1 if free
`website_url`	TEXT		Developer website
`website_domain_id`	INTEGER	FK	Domain reference

app_store_categories

Apple and Google Play category definitions.

Column	Type	Constraints	Description
`id`	TEXT	PK	Category ID
`platform`	TEXT	NOT NULL	apple or google
`type`	TEXT	NOT NULL	grouping, room, chart, category
`name`	TEXT	NOT NULL	Category name
`url`	TEXT		Category URL
`section`	TEXT		apps, games, charts
`metadata`	JSON		Additional data

app_category_rankings

App positions within categories over time.

Column	Type	Constraints	Description
`id`	TEXT	PK	Ranking ID
`app_id`	TEXT	NOT NULL	App reference
`category_id`	TEXT	NOT NULL	Category reference
`platform`	TEXT	NOT NULL	apple or google
`year_week`	TEXT		Week identifier (YYYY-WNN)
`app_rank`	INTEGER	NOT NULL	Position in category
`category_position`	INTEGER		Category shelf position
`shelf_name`	TEXT		Shelf/grouping name
`check_ts`	INTEGER	NOT NULL	Check timestamp
`brand_id_int`	INTEGER	FK	Brand reference

social_accounts

Unified account table for all social platforms.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`platform`	TEXT	NOT NULL	youtube, x, instagram, etc.
`account_id`	TEXT	NOT NULL	Platform-specific ID
`handle`	TEXT		@handle
`display_name`	TEXT		Display name
`profile_url`	TEXT		Profile URL
`domain_id`	INTEGER	FK	Owning domain
`brand_id`	INTEGER	FK	Owning brand
`follower_count`	INTEGER		Followers
`content_count`	INTEGER		Content items

Unique Constraint: (platform, account_id)

brand_properties

Official brand presences on external platforms.

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`domain_id`	INTEGER	FK	Domain owner
`brand_id`	TEXT	FK	Brand owner
`platform`	TEXT	NOT NULL	youtube, x, facebook, etc.
`property_url`	TEXT	NOT NULL	Full URL
`property_id`	TEXT		Platform ID (@handle, UCxxx)
`canonical_id`	TEXT		Stable canonical ID
`status`	TEXT	DEFAULT 'pending'	pending, verified, rejected
`confidence`	REAL		LLM confidence
`user_verified`	INTEGER	DEFAULT 0	1 if user confirmed

SERP Tables

serp_positions

SERP ranking positions (denormalized for query performance).

Column	Type	Constraints	Description
`run_id`	TEXT	NOT NULL, PK (with position)	SERP run reference
`query_id`	TEXT	NOT NULL	Query reference
`project_id`	TEXT	NOT NULL	Project reference
`keyword_id`	TEXT	NOT NULL	Keyword reference
`check_ts`	INTEGER	NOT NULL	Check timestamp
`position`	INTEGER	NOT NULL	SERP position
`domain`	TEXT	NOT NULL	Ranking domain
`domain_id`	INTEGER	FK	Domain reference
`url`	TEXT	NOT NULL	Ranking URL
`title`	TEXT		SERP title
`description`	TEXT		SERP description
`result_type`	TEXT		organic, featured_snippet, etc.

domain_keyword_rankings

Tracks which keywords a domain ranks for (from DataForSEO).

Column	Type	Constraints	Description
`id`	INTEGER	PK, AUTOINCREMENT	Internal ID
`domain`	TEXT	NOT NULL	Domain being tracked
`keyword_id`	TEXT	NOT NULL, FK	Keyword reference
`year_week`	TEXT	NOT NULL	Week snapshot (YYYY-WNN)
`check_ts`	INTEGER	NOT NULL	Check timestamp
`rank_absolute`	INTEGER		Position in full SERP
`rank_group`	INTEGER		Position in organic only
`ranking_url`	TEXT		URL that ranks
`etv`	REAL		Estimated traffic value
`is_featured_snippet`	INTEGER	DEFAULT 0	1 if featured
`location_code`	INTEGER	DEFAULT 2840	Location (US)

Unique Constraint: (domain, keyword_id, year_week, location_code)

Index Strategy

Core Lookup Indexes

-- Domain lookups (most frequent)
CREATE INDEX idx_domains_domain ON domains(domain);
CREATE INDEX idx_domains_tier1_type ON domains(tier1_type);
CREATE INDEX idx_domains_quality_tier ON domains(quality_tier);

-- URL lookups
CREATE INDEX idx_urls_domain_id ON urls(domain_id);
CREATE INDEX idx_urls_page_type ON urls(page_type);
CREATE INDEX idx_urls_structural_type ON urls(structural_type);

-- Keyword lookups
CREATE UNIQUE INDEX idx_keywords_keyword_unique ON keywords(keyword);

Classification Indexes

-- Find low-confidence items for review
CREATE INDEX idx_domains_low_confidence ON domains(domain_type_confidence) 
  WHERE domain_type_confidence < 70;

CREATE INDEX idx_urls_low_page_type_confidence ON urls(page_type_confidence) 
  WHERE page_type_confidence < 70;

-- Find unclassified URLs
CREATE INDEX idx_urls_unclassified ON urls(domain_id) 
  WHERE page_type IS NULL;

Time-Series Indexes

-- Recent data queries
CREATE INDEX idx_urls_created_at ON urls(created_at DESC);
CREATE INDEX idx_backlinks_created_at ON backlinks(created_at DESC);
CREATE INDEX idx_serp_check_ts ON serp_positions(check_ts DESC);

-- Domain keyword rankings time series
CREATE INDEX idx_dkr_domain_week ON domain_keyword_rankings(domain, year_week DESC);

Composite Indexes for Common Queries

-- SERP position lookups
CREATE INDEX idx_serp_kw_proj_ts ON serp_positions(keyword_id, project_id, check_ts DESC);

-- Backlink quality filtering
CREATE INDEX idx_backlinks_quality_media ON backlinks(ref_quality_tier, ref_media_type);
CREATE INDEX idx_backlinks_channel_quality ON backlinks(ref_channel_bucket, ref_quality_tier);

-- App rankings
CREATE INDEX idx_acr_year_week_cat ON app_category_rankings(year_week DESC, category_id);

Common Queries

Domain Analysis

-- Get domain with all classification details
SELECT 
  id, domain, tier1_type, domain_type, quality_tier, domain_rank,
  onboard_status, data_tier, backlinks_count, keywords_count
FROM domains
WHERE domain = 'example.com';

-- Find high-quality publisher domains
SELECT domain, domain_rank, backlinks_count
FROM domains
WHERE tier1_type = 'publisher' 
  AND quality_tier IN ('premium', 'high')
  AND domain_rank > 70
ORDER BY domain_rank DESC
LIMIT 100;

-- Domains needing classification review
SELECT domain, tier1_type, tier1_confidence
FROM domains
WHERE tier1_confidence < 70 OR needs_review = 1
ORDER BY tier1_confidence ASC;

URL Classification

-- Get URLs for a domain by structural type
SELECT url, structural_type, page_type, classification_confidence
FROM urls
WHERE domain_id = ?
ORDER BY structural_type, page_type;

-- Count page types for a domain
SELECT page_type, COUNT(*) as count
FROM urls
WHERE domain_id = ?
GROUP BY page_type
ORDER BY count DESC;

-- Find unclassified URLs
SELECT u.url, d.domain
FROM urls u
JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
LIMIT 1000;

Backlink Intelligence

-- Get backlink breakdown by quality tier
SELECT ref_quality_tier, COUNT(*) as count
FROM backlinks
WHERE target_domain_id = ?
GROUP BY ref_quality_tier
ORDER BY count DESC;

-- Find high-value earned media backlinks
SELECT b.*, u.url as source_url
FROM backlinks b
JOIN urls u ON b.source_url_id = u.id
WHERE b.target_domain_id = ?
  AND b.ref_media_type = 'earned'
  AND b.ref_quality_tier IN ('premium', 'high')
ORDER BY b.ref_domain_rank DESC
LIMIT 100;

-- Risky backlinks needing review
SELECT source_url_id, anchor_text, risk_score, risk_factors
FROM backlinks
WHERE target_domain_id = ?
  AND (risk_score > 70 OR is_broken = 1)
ORDER BY risk_score DESC;

Keyword Analysis

-- Get keyword with all classifications
SELECT k.*, kc.dimension, kc.value, kc.confidence
FROM keywords k
LEFT JOIN keyword_classifications kc ON k.id = kc.keyword_id
WHERE k.keyword = 'best crm software';

-- High-value keywords by commercial index
SELECT keyword, search_volume, cpc, classification_commercial_value_index
FROM keywords
WHERE classification_commercial_value_index > 1000
ORDER BY classification_commercial_value_index DESC
LIMIT 100;

-- Keywords by journey stage
SELECT kc.value as journey_moment, COUNT(*) as count
FROM keyword_classifications kc
WHERE kc.dimension = 'journey_moment'
GROUP BY kc.value
ORDER BY count DESC;

Cost Tracking

-- Daily cost by service
SELECT 
  date(created_at / 1000, 'unixepoch') as date,
  service,
  SUM(cost_usd) as total_cost,
  COUNT(*) as requests
FROM api_costs
WHERE created_at >= strftime('%s', 'now', '-30 days') * 1000
GROUP BY date, service
ORDER BY date DESC, total_cost DESC;

-- Cost by project
SELECT project_id, SUM(cost_usd) as total_cost
FROM api_costs
WHERE project_id IS NOT NULL
GROUP BY project_id
ORDER BY total_cost DESC;

Classification Taxonomy

-- Get all valid page types
SELECT value, display_name, description
FROM classification_values
WHERE entity_type = 'url' 
  AND dimension = 'page_type'
  AND is_active = 1
ORDER BY sort_order;

-- Get page types for a structural type (cascading filter)
SELECT cv.value, cv.display_name
FROM classification_values cv
WHERE cv.entity_type = 'url'
  AND cv.dimension = 'page_type'
  AND cv.category = 'article'  -- Maps to structural_type
  AND cv.is_active = 1
ORDER BY cv.sort_order;

Migration History

Notable Schema Changes

Migration	Date	Description
0001	Initial	Initial schema with topics and embeddings_meta
0003	Early	Categories and keywords tables
010	Early	Apps table for mobile app metadata
0042	Early	Brands table with SHA-256 hash IDs
0044	Early	Split subscriptions into type-specific tables
0079	Phase 0	Classification columns for OEPS model
0086	Major	Unified domains/urls/brands with INTEGER IDs
0091	Backlinks	Quality tier and domain rank columns
0093	Keywords	Keyword classification dimensions
0097	Tier-1	Simplified 7-archetype system
0102	Confidence	Per-dimension confidence scores
0107	Keywords	Normalized keyword_classifications table
0113	Social	Brand properties for ownership resolution
0116	Social	Unified social_accounts and social_content
0119	Learning	Classification corrections table
0131	Backlinks	referring_domains table
0135	Taxonomy	classification_values master taxonomy
0152-153	Hierarchy	structural_type parent layer for page_type
0148	Performance	Critical missing indexes

Key Architecture Decisions

INTEGER IDs over TEXT hashes (0086): Switched from SHA-256 hashes to auto-increment integers for better join performance and simpler debugging.
Denormalized Classification (0079, 0091): Classification results stored directly on entities for query performance, avoiding expensive joins.
Per-Dimension Confidence (0102): Individual confidence scores per classification dimension enables targeted LLM verification.
Normalized Keyword Classifications (0107): One row per dimension avoids 50+ column explosion on keywords table.
Structural Type Hierarchy (0152): Added parent layer for page_type to enable cascading filters (structural_type -> page_type).
Unified Social Tables (0116): Single schema for all platforms instead of platform-specific tables (youtube_channels, etc.).

Performance Considerations

D1 Limitations

No concurrent writes: D1 uses SQLite, so writes are serialized
Row size limits: Keep JSON columns reasonably sized
Index overhead: Balance query performance vs write cost

Best Practices

Batch inserts: Use INSERT OR IGNORE with multiple values
Partial indexes: Use WHERE clauses on indexes for sparse data
Denormalization: Store frequently-joined data directly on entities
Avoid N+1: Use JOINs instead of multiple queries

Monitoring Queries

-- Table sizes
SELECT name, COUNT(*) as rows 
FROM sqlite_master, (SELECT 1 FROM domains LIMIT 1)
WHERE type = 'table' AND name NOT LIKE 'sqlite_%';

-- Index usage (check with EXPLAIN QUERY PLAN)
EXPLAIN QUERY PLAN
SELECT * FROM urls WHERE domain_id = ? AND page_type = 'blog_post';

Schema Overview​

Table Categories​

ER Diagram​

Core Tables​

domains​

urls​

keywords​

referring_domains​

backlinks​

Classification Tables​

classification_values​

keyword_classifications​

classification_corrections​

Operational Tables​

api_costs​

api_provider_usage​

pipeline_activity​

classification_failures​

crawl_runs​

App Store Tables​

apps​

app_store_categories​

app_category_rankings​

Social & Brand Tables​

social_accounts​

brand_properties​

SERP Tables​

serp_positions​

domain_keyword_rankings​

Index Strategy​

Core Lookup Indexes​

Classification Indexes​

Time-Series Indexes​

Composite Indexes for Common Queries​

Common Queries​

Domain Analysis​

URL Classification​

Backlink Intelligence​

Keyword Analysis​

Cost Tracking​

Classification Taxonomy​

Migration History​

Notable Schema Changes​

Key Architecture Decisions​

Performance Considerations​

D1 Limitations​

Best Practices​

Monitoring Queries​

Related Documentation​

Schema Overview

Table Categories

ER Diagram

Core Tables

domains

urls

keywords

referring_domains

backlinks

Classification Tables

classification_values

keyword_classifications

classification_corrections

Operational Tables

api_costs

api_provider_usage

pipeline_activity

classification_failures

crawl_runs

App Store Tables

apps

app_store_categories

app_category_rankings

Social & Brand Tables

social_accounts

brand_properties

SERP Tables

serp_positions

domain_keyword_rankings

Index Strategy

Core Lookup Indexes

Classification Indexes

Time-Series Indexes

Composite Indexes for Common Queries

Common Queries

Domain Analysis

URL Classification

Backlink Intelligence

Keyword Analysis

Cost Tracking

Classification Taxonomy

Migration History

Notable Schema Changes

Key Architecture Decisions

Performance Considerations

D1 Limitations

Best Practices

Monitoring Queries

Related Documentation