D1 Database Schema Documentation
This document provides comprehensive documentation for the RankDisco D1 (SQLite) database schema, including all tables, relationships, indexes, and common query patterns.
Schema Overview
RankDisco uses Cloudflare D1 (SQLite) as its primary database. The schema is designed around three core concepts:
- Entity Tables - Core data entities (domains, URLs, keywords, apps, brands)
- Classification Tables - Multi-dimensional classification results with confidence scores
- Operational Tables - Workflow tracking, cost monitoring, and activity history
Table Categories
| Category | Tables | Purpose |
|---|---|---|
| Core Entities | domains, urls, keywords, brands, apps | Primary data entities |
| Backlink Intelligence | backlinks, referring_domains, domain_summaries | Link graph data |
| SERP Data | serp_positions, serp_runs, queries, domain_keyword_rankings | Search ranking data |
| Classification | classification_values, keyword_classifications, classification_corrections | Taxonomy and corrections |
| Social/Brand | social_accounts, social_content, brand_properties | Social media ownership |
| App Store | app_store_categories, app_category_rankings | Mobile app intelligence |
| Operational | api_costs, pipeline_activity, workflow_instances, crawl_runs | System operations |
| Subscriptions | keyword_subscriptions, app_subscriptions, domain_subscriptions | Scheduled tracking |
ER Diagram
Core Tables
domains
The central entity table for all tracked domains. Contains classification results, onboarding status, and aggregated metrics.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
domain | TEXT | NOT NULL, UNIQUE | Domain name (e.g., "example.com") |
root_domain | TEXT | Root domain for subdomains | |
tld | TEXT | Top-level domain (.com, .org, etc.) | |
is_developer | INTEGER | DEFAULT 0 | 1 if this is an app developer domain |
first_seen | INTEGER | NOT NULL | Unix timestamp when first discovered |
last_seen | INTEGER | NOT NULL | Unix timestamp when last seen |
created_at | INTEGER | Creation timestamp | |
updated_at | INTEGER | Last update timestamp |
Classification Columns:
| Column | Type | Description |
|---|---|---|
tier1_type | TEXT | Primary archetype: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform |
tier1_confidence | REAL | Confidence score (0-100) |
tier1_signals | TEXT | JSON array of signals used |
domain_type | TEXT | Specific domain type (news, blog, magazine, etc.) |
quality_tier | TEXT | premium, high, medium, low, spam |
channel_bucket | TEXT | organic, paid, social, referral, etc. |
media_type | TEXT | owned, earned, shared, paid |
ownership_type | TEXT | owned, earned, shared, paid |
domain_rank | INTEGER | DataForSEO domain rank (0-100) |
is_high_risk | INTEGER | 1 if flagged as risky |
high_risk_reason | TEXT | Reason for high-risk flag |
Onboarding Columns:
| Column | Type | Description |
|---|---|---|
data_tier | TEXT | customer, discovered, global_sample |
discovery_source | TEXT | app_crawl, manual, serp_crawl, backlink_crawl |
onboard_status | TEXT | pending, complete, failed |
onboard_started_at | INTEGER | Timestamp |
onboard_completed_at | INTEGER | Timestamp |
onboard_duration_ms | INTEGER | Total onboard time |
backlinks_fetched_at | INTEGER | When backlinks were last fetched |
keywords_fetched_at | INTEGER | When keywords were last fetched |
backlinks_count | INTEGER | Denormalized count |
keywords_count | INTEGER | Denormalized count |
Example Data:
INSERT INTO domains (id, domain, tier1_type, quality_tier, domain_rank, data_tier, onboard_status)
VALUES
(1, 'techcrunch.com', 'publisher', 'premium', 85, 'global_sample', 'complete'),
(2, 'shopify.com', 'saas', 'premium', 92, 'customer', 'complete'),
(3, 'reddit.com', 'community', 'premium', 95, 'global_sample', 'complete');
urls
Stores individual URLs with page-level classification and R2 content references.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
url | TEXT | NOT NULL, UNIQUE | Full URL |
domain | TEXT | NOT NULL | Domain name (denormalized) |
domain_id | INTEGER | FK | Reference to domains.id |
url_type | TEXT | URL classification type | |
normalized_url | TEXT | Normalized URL for deduplication | |
first_seen | INTEGER | NOT NULL | First discovery timestamp |
last_seen | INTEGER | NOT NULL | Last seen timestamp |
Classification Columns:
| Column | Type | Description |
|---|---|---|
structural_type | TEXT | Top-level: article, detail, listing, thread, utility, corporate, reference, spam |
structural_type_confidence | INTEGER | Confidence (0-100) |
page_type | TEXT | Specific page type (blog_post, product, category, etc.) |
page_type_confidence | REAL | Confidence score |
tactic_type | TEXT | Marketing tactic (seo_content, pr, affiliate, etc.) |
channel_bucket | TEXT | Traffic channel |
media_type | TEXT | PESO classification |
quality_tier | TEXT | Inherited or URL-specific quality |
domain_rank | INTEGER | Denormalized from domain |
Content Flags:
| Column | Type | Description |
|---|---|---|
has_video | INTEGER | 1 if video present |
has_audio | INTEGER | 1 if audio present |
has_gallery | INTEGER | 1 if image gallery |
user_reviews_present | INTEGER | 1 if user reviews |
affiliate_links_present | INTEGER | 1 if affiliate links |
sponsored_disclosure_present | INTEGER | 1 if sponsored |
paywalled | INTEGER | 1 if paywalled |
thin_content | INTEGER | 1 if thin content |
auto_generated | INTEGER | 1 if programmatic |
R2 Content Storage:
| Column | Type | Description |
|---|---|---|
r2_content_key | TEXT | R2 key for extracted JSON content |
r2_html_key | TEXT | R2 key for raw gzipped HTML |
content_crawled_at | INTEGER | When crawled |
content_source | TEXT | native, zenrows, instant_pages |
content_cost | REAL | Cost of crawl in USD |
Example Data:
INSERT INTO urls (id, url, domain, domain_id, structural_type, page_type, quality_tier)
VALUES
(1, 'https://techcrunch.com/2024/01/15/startup-funding/', 'techcrunch.com', 1, 'article', 'news', 'premium'),
(2, 'https://shopify.com/pricing', 'shopify.com', 2, 'corporate', 'pricing', 'premium'),
(3, 'https://reddit.com/r/startups/comments/abc123', 'reddit.com', 3, 'thread', 'forum_thread', 'high');
keywords
Global keyword table with metrics from DataForSEO and classification dimensions.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | TEXT | PK | Keyword ID (hash or auto) |
keyword | TEXT | NOT NULL, UNIQUE | Keyword text |
project_id | TEXT | Legacy project association | |
metrics | JSON | Raw DataForSEO metrics | |
category_id | TEXT | Category association | |
saved | INTEGER | DEFAULT 0 | User saved flag |
parent_seed | TEXT | Parent keyword if expanded | |
created_at | TEXT | Creation timestamp |
DataForSEO Metrics (via metrics JSON or direct columns):
| Field | Type | Description |
|---|---|---|
search_volume | INTEGER | Monthly search volume |
cpc | REAL | Cost per click |
keyword_difficulty | INTEGER | SEO difficulty (0-100) |
competition | REAL | Competition level (0-1) |
search_intent | TEXT | DataForSEO intent |
monthly_searches | JSON | Array of monthly volumes |
Classification Columns (prefix: classification_):
| Column | Type | Description |
|---|---|---|
classification_funnel_stage | TEXT | awareness, consideration, decision |
classification_intent_type | TEXT | informational, navigational, commercial, transactional |
classification_brand_class | TEXT | brand_own, brand_competitor, generic |
classification_keyword_pattern | TEXT | question, comparison, superlative, etc. |
classification_ranking_difficulty | TEXT | very_easy, easy, medium, hard, very_hard |
classification_commercial_value_index | REAL | Derived from volume * CPC |
Example Data:
INSERT INTO keywords (id, keyword, search_volume, cpc, keyword_difficulty)
VALUES
('kw_abc123', 'best crm software', 12000, 45.50, 72),
('kw_def456', 'how to start a business', 85000, 8.20, 45),
('kw_ghi789', 'shopify vs woocommerce', 8500, 32.00, 58);
referring_domains
Stores referring domain data from DataForSEO backlinks API.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
target_domain | TEXT | NOT NULL | Domain being analyzed |
referring_domain | TEXT | NOT NULL | Domain linking to target |
rank | INTEGER | Domain rank (0-1000 scale) | |
backlinks_count | INTEGER | Number of backlinks from this domain | |
spam_score | INTEGER | Spam score (0-100) |
Link Quality Metrics:
| Column | Type | Description |
|---|---|---|
broken_backlinks | INTEGER | Broken link count |
broken_pages | INTEGER | Broken page count |
referring_domains_count | INTEGER | This domain's referring domains |
referring_ips | INTEGER | Unique IPs |
referring_subnets | INTEGER | Unique subnets |
Distribution Data (JSON):
| Column | Type | Description |
|---|---|---|
links_tld | TEXT | TLD distribution |
links_types | TEXT | Link type distribution |
links_attributes | TEXT | Link attribute distribution |
links_platform_types | TEXT | Platform type distribution |
links_countries | TEXT | Country distribution |
Example Data:
INSERT INTO referring_domains (target_domain, referring_domain, rank, backlinks_count, spam_score)
VALUES
('shopify.com', 'forbes.com', 850, 45, 2),
('shopify.com', 'techcrunch.com', 780, 28, 1),
('shopify.com', 'spamsite.xyz', 50, 1000, 95);
backlinks
Individual backlink records with source/target URL references and classification.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
source_url_id | INTEGER | NOT NULL, FK | URL providing the link |
target_url_id | INTEGER | NOT NULL, FK | URL receiving the link |
referring_domain_id | INTEGER | FK | Source domain |
target_domain_id | INTEGER | FK | Target domain |
anchor_text | TEXT | Link anchor text | |
discovered_at | INTEGER | NOT NULL | First discovery timestamp |
last_seen_at | INTEGER | NOT NULL | Last seen timestamp |
is_dofollow | INTEGER | DEFAULT 1 | 1 if dofollow |
link_strength | REAL | Calculated link strength |
Referring Source Classification (denormalized):
| Column | Type | Description |
|---|---|---|
ref_domain_type | TEXT | Source domain type |
ref_tier1_type | TEXT | Source tier1 archetype |
ref_channel_bucket | TEXT | Source channel |
ref_media_type | TEXT | Source PESO type |
ref_page_type | TEXT | Source page type |
ref_tactic_type | TEXT | Source tactic |
ref_quality_tier | TEXT | Source quality |
ref_domain_rank | INTEGER | Source domain rank |
Risk Signals:
| Column | Type | Description |
|---|---|---|
backlink_spam_score | INTEGER | Spam score (0-100) |
is_broken | INTEGER | 1 if broken |
is_new | INTEGER | 1 if newly discovered |
is_lost | INTEGER | 1 if lost |
risk_score | INTEGER | Computed risk (0-100) |
risk_factors | TEXT | JSON array of factors |
semantic_location | TEXT | header, footer, sidebar, body |
Classification Tables
classification_values
Single source of truth for all valid classification values (taxonomy).
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
entity_type | TEXT | NOT NULL | domain, url, keyword, or all |
dimension | TEXT | NOT NULL | tier1_type, page_type, etc. |
value | TEXT | NOT NULL | The classification value |
display_name | TEXT | Human-readable name | |
description | TEXT | Help text | |
category | TEXT | Grouping category | |
sort_order | INTEGER | DEFAULT 0 | Display order |
is_active | INTEGER | DEFAULT 1 | 1 if active |
Key Dimensions:
Domain Dimensions:
tier1_type: publisher, ecommerce, saas, marketplace, service_provider, community, government, educational, nonprofit, social_platform, search_engine, infrastructuredomain_type: news, blog, magazine, review_site, directory, forum, wiki, q_and_a, video_platform, retail, agency, local_business, university, etc.quality_tier: premium, high, medium, low, spamownership_type: owned, earned, shared, paid
URL Dimensions:
structural_type: article, detail, listing, thread, utility, corporate, reference, spampage_type: blog_post, news, product, category, landing_page, homepage, pricing, documentation, forum_thread, video_page, tool_page, etc.tactic_type: seo_content, thought_leadership, product_marketing, affiliate, ugc, pr, community, programmatic
Keyword Dimensions:
journey_moment: learning, evaluation, validation, action, post_purchaseintent_type: informational, navigational, commercial, transactional, localexpertise_level: novice, intermediate, advanced, expertkeyword_pattern: question, comparison, superlative, problem, use_case, brand_modifier, local_modifier
keyword_classifications
Normalized keyword classifications (one row per dimension per keyword).
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
keyword_id | INTEGER | NOT NULL, FK | Reference to keywords |
dimension | TEXT | NOT NULL | Classification dimension |
value | TEXT | The classification value | |
confidence | REAL | Confidence (0-100) | |
signals | TEXT | JSON with reasoning/evidence | |
source | TEXT | rules, llm, vectorize, hybrid | |
version | INTEGER | DEFAULT 1 | Classifier version |
created_at | INTEGER | Creation timestamp | |
updated_at | INTEGER | Update timestamp |
Unique Constraint: (keyword_id, dimension)
classification_corrections
Tracks user corrections to classifications for learning.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
entity_type | TEXT | NOT NULL | domain, url, keyword |
entity_id | INTEGER | NOT NULL | ID in respective table |
entity_value | TEXT | NOT NULL | Domain/URL/keyword text |
dimension | TEXT | NOT NULL | Which dimension |
original_value | TEXT | What classifier said | |
original_confidence | INTEGER | Confidence (0-100) | |
corrected_value | TEXT | NOT NULL | Correct value |
corrected_by | TEXT | User ID or 'admin' | |
reason | TEXT | Correction reason | |
learning_status | TEXT | DEFAULT 'pending' | pending, learned, skipped |
Operational Tables
api_costs
Tracks API costs per request for budget monitoring.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
run_id | TEXT | FK to crawl_runs | |
project_id | TEXT | FK to projects | |
service | TEXT | NOT NULL | dataforseo, zenrows_basic, etc. |
endpoint | TEXT | Specific endpoint | |
cost_usd | REAL | NOT NULL, DEFAULT 0 | Cost in USD |
request_count | INTEGER | DEFAULT 1 | Number of requests |
success | INTEGER | DEFAULT 1 | 1 = success |
items_returned | INTEGER | Items returned | |
created_at | INTEGER | NOT NULL | Timestamp |
Service Types:
dataforseo- SERP, backlinks, keywords APIszenrows_basic- Basic proxy (~$1/1000)zenrows_premium- Residential proxy (~$10/1000)zenrows_js_render- JS rendering (~$5/1000)rss_feed- Free RSS feeds
api_provider_usage
Daily API quota and cost tracking across all providers.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
date | TEXT | NOT NULL | YYYY-MM-DD |
provider | TEXT | NOT NULL | Provider name |
requests_count | INTEGER | DEFAULT 0 | Request count |
units_used | INTEGER | DEFAULT 0 | API units consumed |
operations | TEXT | JSON breakdown | |
cost_cents | INTEGER | DEFAULT 0 | Cost in cents |
quota_limit | INTEGER | Daily quota | |
quota_remaining | INTEGER | Remaining quota |
pipeline_activity
Stores activity events for bottleneck detection.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
node_id | TEXT | NOT NULL | DAG node ID |
event_type | TEXT | NOT NULL | started, completed, failed, queued |
run_id | TEXT | Workflow run ID | |
domain | TEXT | Domain being processed | |
latency_ms | INTEGER | Time taken | |
error_message | TEXT | Error details | |
meta | TEXT | JSON metadata | |
created_at | INTEGER | NOT NULL | Timestamp |
classification_failures
Logs URLs and keywords that failed classification.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
failure_type | TEXT | NOT NULL | url or keyword |
url_id | INTEGER | For URL failures | |
url | TEXT | URL text | |
domain | TEXT | Domain | |
keyword_id | INTEGER | For keyword failures | |
keyword | TEXT | Keyword text | |
owner_domain_id | INTEGER | FK | Domain being onboarded |
error_message | TEXT | Error details | |
retry_count | INTEGER | DEFAULT 0 | Retry attempts |
created_at | INTEGER | NOT NULL | Timestamp |
crawl_runs
Tracks crawl job execution status.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | TEXT | PK | Run ID |
platform | TEXT | NOT NULL | apple, google_play |
device | TEXT | Device type | |
started_at | INTEGER | NOT NULL | Start timestamp |
completed_at | INTEGER | Completion timestamp | |
status | TEXT | NOT NULL, DEFAULT 'pending' | pending, in_progress, completed, failed |
total_categories | INTEGER | NOT NULL | Categories to crawl |
categories_completed | INTEGER | DEFAULT 0 | Completed count |
categories_failed | INTEGER | DEFAULT 0 | Failed count |
total_apps_found | INTEGER | DEFAULT 0 | Apps discovered |
total_cost_usd | REAL | DEFAULT 0 | Total cost |
config | TEXT | JSON configuration |
App Store Tables
apps
Canonical app metadata table.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | TEXT | PK | App ID |
platform | TEXT | NOT NULL, DEFAULT 'apple' | apple or google |
bundle_id | TEXT | Bundle identifier | |
track_id | TEXT | Store track ID | |
title | TEXT | App title | |
subtitle | TEXT | App subtitle | |
developer_name | TEXT | Developer name | |
developer_id | TEXT | Developer ID | |
brand_id_int | INTEGER | FK | Brand reference |
primary_category | TEXT | Main category | |
categories | JSON | All categories | |
rating | REAL | Average rating | |
rating_count | INTEGER | Number of ratings | |
price | REAL | Price | |
is_free | INTEGER | DEFAULT 0 | 1 if free |
website_url | TEXT | Developer website | |
website_domain_id | INTEGER | FK | Domain reference |
app_store_categories
Apple and Google Play category definitions.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | TEXT | PK | Category ID |
platform | TEXT | NOT NULL | apple or google |
type | TEXT | NOT NULL | grouping, room, chart, category |
name | TEXT | NOT NULL | Category name |
url | TEXT | Category URL | |
section | TEXT | apps, games, charts | |
metadata | JSON | Additional data |
app_category_rankings
App positions within categories over time.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | TEXT | PK | Ranking ID |
app_id | TEXT | NOT NULL | App reference |
category_id | TEXT | NOT NULL | Category reference |
platform | TEXT | NOT NULL | apple or google |
year_week | TEXT | Week identifier (YYYY-WNN) | |
app_rank | INTEGER | NOT NULL | Position in category |
category_position | INTEGER | Category shelf position | |
shelf_name | TEXT | Shelf/grouping name | |
check_ts | INTEGER | NOT NULL | Check timestamp |
brand_id_int | INTEGER | FK | Brand reference |
Social & Brand Tables
social_accounts
Unified account table for all social platforms.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
platform | TEXT | NOT NULL | youtube, x, instagram, etc. |
account_id | TEXT | NOT NULL | Platform-specific ID |
handle | TEXT | @handle | |
display_name | TEXT | Display name | |
profile_url | TEXT | Profile URL | |
domain_id | INTEGER | FK | Owning domain |
brand_id | INTEGER | FK | Owning brand |
follower_count | INTEGER | Followers | |
content_count | INTEGER | Content items |
Unique Constraint: (platform, account_id)
brand_properties
Official brand presences on external platforms.
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
domain_id | INTEGER | FK | Domain owner |
brand_id | TEXT | FK | Brand owner |
platform | TEXT | NOT NULL | youtube, x, facebook, etc. |
property_url | TEXT | NOT NULL | Full URL |
property_id | TEXT | Platform ID (@handle, UCxxx) | |
canonical_id | TEXT | Stable canonical ID | |
status | TEXT | DEFAULT 'pending' | pending, verified, rejected |
confidence | REAL | LLM confidence | |
user_verified | INTEGER | DEFAULT 0 | 1 if user confirmed |
SERP Tables
serp_positions
SERP ranking positions (denormalized for query performance).
| Column | Type | Constraints | Description |
|---|---|---|---|
run_id | TEXT | NOT NULL, PK (with position) | SERP run reference |
query_id | TEXT | NOT NULL | Query reference |
project_id | TEXT | NOT NULL | Project reference |
keyword_id | TEXT | NOT NULL | Keyword reference |
check_ts | INTEGER | NOT NULL | Check timestamp |
position | INTEGER | NOT NULL | SERP position |
domain | TEXT | NOT NULL | Ranking domain |
domain_id | INTEGER | FK | Domain reference |
url | TEXT | NOT NULL | Ranking URL |
title | TEXT | SERP title | |
description | TEXT | SERP description | |
result_type | TEXT | organic, featured_snippet, etc. |
domain_keyword_rankings
Tracks which keywords a domain ranks for (from DataForSEO).
| Column | Type | Constraints | Description |
|---|---|---|---|
id | INTEGER | PK, AUTOINCREMENT | Internal ID |
domain | TEXT | NOT NULL | Domain being tracked |
keyword_id | TEXT | NOT NULL, FK | Keyword reference |
year_week | TEXT | NOT NULL | Week snapshot (YYYY-WNN) |
check_ts | INTEGER | NOT NULL | Check timestamp |
rank_absolute | INTEGER | Position in full SERP | |
rank_group | INTEGER | Position in organic only | |
ranking_url | TEXT | URL that ranks | |
etv | REAL | Estimated traffic value | |
is_featured_snippet | INTEGER | DEFAULT 0 | 1 if featured |
location_code | INTEGER | DEFAULT 2840 | Location (US) |
Unique Constraint: (domain, keyword_id, year_week, location_code)
Index Strategy
Core Lookup Indexes
-- Domain lookups (most frequent)
CREATE INDEX idx_domains_domain ON domains(domain);
CREATE INDEX idx_domains_tier1_type ON domains(tier1_type);
CREATE INDEX idx_domains_quality_tier ON domains(quality_tier);
-- URL lookups
CREATE INDEX idx_urls_domain_id ON urls(domain_id);
CREATE INDEX idx_urls_page_type ON urls(page_type);
CREATE INDEX idx_urls_structural_type ON urls(structural_type);
-- Keyword lookups
CREATE UNIQUE INDEX idx_keywords_keyword_unique ON keywords(keyword);
Classification Indexes
-- Find low-confidence items for review
CREATE INDEX idx_domains_low_confidence ON domains(domain_type_confidence)
WHERE domain_type_confidence < 70;
CREATE INDEX idx_urls_low_page_type_confidence ON urls(page_type_confidence)
WHERE page_type_confidence < 70;
-- Find unclassified URLs
CREATE INDEX idx_urls_unclassified ON urls(domain_id)
WHERE page_type IS NULL;
Time-Series Indexes
-- Recent data queries
CREATE INDEX idx_urls_created_at ON urls(created_at DESC);
CREATE INDEX idx_backlinks_created_at ON backlinks(created_at DESC);
CREATE INDEX idx_serp_check_ts ON serp_positions(check_ts DESC);
-- Domain keyword rankings time series
CREATE INDEX idx_dkr_domain_week ON domain_keyword_rankings(domain, year_week DESC);
Composite Indexes for Common Queries
-- SERP position lookups
CREATE INDEX idx_serp_kw_proj_ts ON serp_positions(keyword_id, project_id, check_ts DESC);
-- Backlink quality filtering
CREATE INDEX idx_backlinks_quality_media ON backlinks(ref_quality_tier, ref_media_type);
CREATE INDEX idx_backlinks_channel_quality ON backlinks(ref_channel_bucket, ref_quality_tier);
-- App rankings
CREATE INDEX idx_acr_year_week_cat ON app_category_rankings(year_week DESC, category_id);
Common Queries
Domain Analysis
-- Get domain with all classification details
SELECT
id, domain, tier1_type, domain_type, quality_tier, domain_rank,
onboard_status, data_tier, backlinks_count, keywords_count
FROM domains
WHERE domain = 'example.com';
-- Find high-quality publisher domains
SELECT domain, domain_rank, backlinks_count
FROM domains
WHERE tier1_type = 'publisher'
AND quality_tier IN ('premium', 'high')
AND domain_rank > 70
ORDER BY domain_rank DESC
LIMIT 100;
-- Domains needing classification review
SELECT domain, tier1_type, tier1_confidence
FROM domains
WHERE tier1_confidence < 70 OR needs_review = 1
ORDER BY tier1_confidence ASC;
URL Classification
-- Get URLs for a domain by structural type
SELECT url, structural_type, page_type, classification_confidence
FROM urls
WHERE domain_id = ?
ORDER BY structural_type, page_type;
-- Count page types for a domain
SELECT page_type, COUNT(*) as count
FROM urls
WHERE domain_id = ?
GROUP BY page_type
ORDER BY count DESC;
-- Find unclassified URLs
SELECT u.url, d.domain
FROM urls u
JOIN domains d ON u.domain_id = d.id
WHERE u.page_type IS NULL
LIMIT 1000;
Backlink Intelligence
-- Get backlink breakdown by quality tier
SELECT ref_quality_tier, COUNT(*) as count
FROM backlinks
WHERE target_domain_id = ?
GROUP BY ref_quality_tier
ORDER BY count DESC;
-- Find high-value earned media backlinks
SELECT b.*, u.url as source_url
FROM backlinks b
JOIN urls u ON b.source_url_id = u.id
WHERE b.target_domain_id = ?
AND b.ref_media_type = 'earned'
AND b.ref_quality_tier IN ('premium', 'high')
ORDER BY b.ref_domain_rank DESC
LIMIT 100;
-- Risky backlinks needing review
SELECT source_url_id, anchor_text, risk_score, risk_factors
FROM backlinks
WHERE target_domain_id = ?
AND (risk_score > 70 OR is_broken = 1)
ORDER BY risk_score DESC;
Keyword Analysis
-- Get keyword with all classifications
SELECT k.*, kc.dimension, kc.value, kc.confidence
FROM keywords k
LEFT JOIN keyword_classifications kc ON k.id = kc.keyword_id
WHERE k.keyword = 'best crm software';
-- High-value keywords by commercial index
SELECT keyword, search_volume, cpc, classification_commercial_value_index
FROM keywords
WHERE classification_commercial_value_index > 1000
ORDER BY classification_commercial_value_index DESC
LIMIT 100;
-- Keywords by journey stage
SELECT kc.value as journey_moment, COUNT(*) as count
FROM keyword_classifications kc
WHERE kc.dimension = 'journey_moment'
GROUP BY kc.value
ORDER BY count DESC;
Cost Tracking
-- Daily cost by service
SELECT
date(created_at / 1000, 'unixepoch') as date,
service,
SUM(cost_usd) as total_cost,
COUNT(*) as requests
FROM api_costs
WHERE created_at >= strftime('%s', 'now', '-30 days') * 1000
GROUP BY date, service
ORDER BY date DESC, total_cost DESC;
-- Cost by project
SELECT project_id, SUM(cost_usd) as total_cost
FROM api_costs
WHERE project_id IS NOT NULL
GROUP BY project_id
ORDER BY total_cost DESC;
Classification Taxonomy
-- Get all valid page types
SELECT value, display_name, description
FROM classification_values
WHERE entity_type = 'url'
AND dimension = 'page_type'
AND is_active = 1
ORDER BY sort_order;
-- Get page types for a structural type (cascading filter)
SELECT cv.value, cv.display_name
FROM classification_values cv
WHERE cv.entity_type = 'url'
AND cv.dimension = 'page_type'
AND cv.category = 'article' -- Maps to structural_type
AND cv.is_active = 1
ORDER BY cv.sort_order;
Migration History
Notable Schema Changes
| Migration | Date | Description |
|---|---|---|
| 0001 | Initial | Initial schema with topics and embeddings_meta |
| 0003 | Early | Categories and keywords tables |
| 010 | Early | Apps table for mobile app metadata |
| 0042 | Early | Brands table with SHA-256 hash IDs |
| 0044 | Early | Split subscriptions into type-specific tables |
| 0079 | Phase 0 | Classification columns for OEPS model |
| 0086 | Major | Unified domains/urls/brands with INTEGER IDs |
| 0091 | Backlinks | Quality tier and domain rank columns |
| 0093 | Keywords | Keyword classification dimensions |
| 0097 | Tier-1 | Simplified 7-archetype system |
| 0102 | Confidence | Per-dimension confidence scores |
| 0107 | Keywords | Normalized keyword_classifications table |
| 0113 | Social | Brand properties for ownership resolution |
| 0116 | Social | Unified social_accounts and social_content |
| 0119 | Learning | Classification corrections table |
| 0131 | Backlinks | referring_domains table |
| 0135 | Taxonomy | classification_values master taxonomy |
| 0152-153 | Hierarchy | structural_type parent layer for page_type |
| 0148 | Performance | Critical missing indexes |
Key Architecture Decisions
-
INTEGER IDs over TEXT hashes (0086): Switched from SHA-256 hashes to auto-increment integers for better join performance and simpler debugging.
-
Denormalized Classification (0079, 0091): Classification results stored directly on entities for query performance, avoiding expensive joins.
-
Per-Dimension Confidence (0102): Individual confidence scores per classification dimension enables targeted LLM verification.
-
Normalized Keyword Classifications (0107): One row per dimension avoids 50+ column explosion on keywords table.
-
Structural Type Hierarchy (0152): Added parent layer for page_type to enable cascading filters (structural_type -> page_type).
-
Unified Social Tables (0116): Single schema for all platforms instead of platform-specific tables (youtube_channels, etc.).
Performance Considerations
D1 Limitations
- No concurrent writes: D1 uses SQLite, so writes are serialized
- Row size limits: Keep JSON columns reasonably sized
- Index overhead: Balance query performance vs write cost
Best Practices
- Batch inserts: Use
INSERT OR IGNOREwith multiple values - Partial indexes: Use
WHEREclauses on indexes for sparse data - Denormalization: Store frequently-joined data directly on entities
- Avoid N+1: Use JOINs instead of multiple queries
Monitoring Queries
-- Table sizes
SELECT name, COUNT(*) as rows
FROM sqlite_master, (SELECT 1 FROM domains LIMIT 1)
WHERE type = 'table' AND name NOT LIKE 'sqlite_%';
-- Index usage (check with EXPLAIN QUERY PLAN)
EXPLAIN QUERY PLAN
SELECT * FROM urls WHERE domain_id = ? AND page_type = 'blog_post';