Skip to main content

RankDisco Troubleshooting Guide

This guide covers common issues, error messages, and step-by-step resolution procedures for RankDisco operations.

Table of Contents

  1. Common Issues
  2. Classification Failures
  3. Queue Issues
  4. Workflow Failures
  5. API Errors
  6. Performance Issues
  7. Cost Overruns
  8. Debug Endpoints
  9. Logging
  10. Escalation Path

Common Issues

1. Domain Onboarding Stuck in "Pending" State

Symptoms:

  • Domain shows onboard_status = 'pending' for extended period
  • No backlinks or keywords appearing

Error Messages:

[domain-onboard] DRAIN_MODE active, delaying messages
[domain-onboard] Workflow failed, falling back to queue

Resolution:

  1. Check if DRAIN_MODE is enabled:

    curl https://<worker>/api/admin/queues/status
  2. Verify workflow binding exists in wrangler.toml:

    [[workflows]]
    name = "domain-onboard"
    binding = "DOMAIN_ONBOARD_WORKFLOW"
  3. Force restart the onboarding:

    curl -X POST https://<worker>/api/admin/domains/onboard \
    -H "Content-Type: application/json" \
    -d '{"domain": "example.com"}'
  4. Check workflow status:

    curl https://<worker>/api/admin/workflow/domain-onboard/<workflow_id>

2. URLs Not Classifying

Symptoms:

  • URLs have page_type = NULL
  • classification_source = 'skipped_domain_unavailable'

Error Messages:

[url-classify] URL not found in database: https://... - will retry
[url-classify] Giving up on classify_url after 3 attempts

Resolution:

  1. Check for race conditions (URL not yet in DB when classification queued):

    curl -X POST https://<worker>/api/admin/domains/queue-classification \
    -H "Content-Type: application/json" \
    -d '{"domain": "example.com", "limit": 100}'
  2. Force reclassification:

    curl -X POST https://<worker>/api/admin/domains/reclassify-all \
    -H "Content-Type: application/json" \
    -d '{"domain": "example.com", "page_type": null, "limit": 100}'
  3. Check classification failures:

    curl https://<worker>/api/admin/domains/classification-failures

3. ClickHouse Connection Failures

Symptoms:

  • Harvests complete but data not appearing in analytics
  • Error logs showing connection timeouts

Error Messages:

ClickHouse connection failed: ECONNREFUSED
ClickHouse authentication failed

Resolution:

  1. Verify ClickHouse connectivity:

    curl https://<worker>/test/clickhouse
  2. Check credentials in Wrangler secrets:

    wrangler secret list
    # Verify: CLICKHOUSE_HOST, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD, CLICKHOUSE_DATABASE
  3. Test direct connection:

    curl "https://<CLICKHOUSE_HOST>:8443/?user=<USER>&password=<PASS>&database=<DB>&query=SELECT%201"

4. DataForSEO Rate Limits

Symptoms:

  • Backlink/keyword fetches failing
  • Costs spiking unexpectedly

Error Messages:

DataForSEO rate limit exceeded
DataForSEO API error: 429

Resolution:

  1. Check current quotas:

    curl https://<worker>/api/admin/costs
  2. Adjust rate limits in Wrangler vars:

    [vars]
    DATAFORSEO_LABS_LIMIT = "100"
    DATAFORSEO_LABS_MAX_REQUESTS = "50"
  3. Reset budget tracking if stuck:

    wrangler kv:key delete --binding=DFS_BUDGETS "daily_budget"

5. Apple iTunes Rate Limiting

Symptoms:

  • App details failing to fetch
  • itms-appss:// URLs appearing in logs

Error Messages:

[app-details] rate limited by iTunes API
iTunes returned 403/429 status

Resolution:

  1. Verify ZenRows API key is set:

    wrangler secret list | grep ZENROWS
  2. Check ZenRows quota:

    curl https://<worker>/api/admin/costs/zenrows
  3. The worker automatically backs off - wait 30-60 minutes for recovery.


6. Cloudflare Images Upload Failures

Symptoms:

  • Icons not uploading for apps/domains
  • Original URLs used as fallback

Error Messages:

CF Images upload failed: 401 Unauthorized
CF Images: missing API token

Resolution:

  1. Verify CF Images secrets:

    wrangler secret list | grep CF_IMAGES
    # Required: CF_IMAGES_ACCOUNT_ID, CF_IMAGES_API_TOKEN, CF_IMAGES_ACCOUNT_HASH
  2. Test the Images API directly:

    curl -X POST "https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/images/v1" \
    -H "Authorization: Bearer <API_TOKEN>" \
    -F "url=https://example.com/icon.png"

7. Vectorize Index Empty or Stale

Symptoms:

  • Classification always falling through to LLM
  • High LLM costs for common domains

Error Messages:

[vectorize] No matches found with score > 0.5
Vectorize index empty or unavailable

Resolution:

  1. Check Vectorize stats:

    curl https://<worker>/api/admin/classifier/stats
  2. Initialize seed examples:

    curl -X POST https://<worker>/api/admin/classifier/init-vectorize
  3. Bulk load from classified domains:

    curl -X POST https://<worker>/api/admin/classifier/bulk-load-vectorize?min_confidence=0.85&limit=1000

8. Workflow Timeout Errors

Symptoms:

  • Workflows stuck in "running" state
  • Operations timing out after 2 minutes

Error Messages:

Workflow still running after timeout
Workflow errored: Step execution exceeded timeout

Resolution:

  1. Check workflow status:

    curl https://<worker>/api/admin/workflow/<workflow-type>/<workflow-id>
  2. Workflows have built-in retry logic. Wait for automatic recovery.

  3. For persistent failures, check if external APIs are down:

    • DataForSEO status page
    • Cloudflare status page

9. D1 Database Errors

Symptoms:

  • Insert/update failures
  • "Too many subrequests" errors

Error Messages:

D1_ERROR: too many subrequests
D1_ERROR: UNIQUE constraint failed
D1_ERROR: SQLITE_BUSY

Resolution:

  1. Too many subrequests: Batch operations are limited to ~50 statements per batch. Check batch sizes in code.

  2. UNIQUE constraint: Check for duplicate data:

    SELECT url_hash, COUNT(*) FROM urls GROUP BY url_hash HAVING COUNT(*) > 1;
  3. SQLITE_BUSY: Reduce concurrent writes or implement backoff.


10. Missing Environment Variables

Symptoms:

  • Various "not configured" errors
  • Features silently disabled

Error Messages:

DATAFORSEO_LOGIN not configured
AI binding not configured
VECTORIZE_BACKLINK binding required

Resolution:

  1. Check all required bindings in wrangler.toml:

    [vars]
    # Required vars listed here

    [[kv_namespaces]]
    binding = "LOOKUP_CACHE"

    [[d1_databases]]
    binding = "DB"

    [[queues.producers]]
    binding = "DOMAIN_ONBOARD_QUEUE"

    [[vectorize]]
    binding = "VECTORIZE_BACKLINK"

    [ai]
    binding = "AI"
  2. Verify secrets are set:

    wrangler secret list

Classification Failures

Why URLs Fail to Classify

Classification uses a multi-stage pipeline. Failures can occur at any stage:

StageCostCommon Failures
0. CacheFREECache miss (expected)
1. RulesFREENo matching pattern
2. Vectorize$0.0001Index empty, low similarity
3. Content$0.001-0.01Fetch blocked, timeout
4. LLM$0.001Rate limit, malformed response

Debugging Classification

  1. Test single URL classification:

    curl -X POST https://<worker>/api/admin/classifier/classify \
    -H "Content-Type: application/json" \
    -d '{"url": "https://example.com/article", "options": {"force_llm": false}}'
  2. Test rules only (no external calls):

    curl -X POST https://<worker>/api/admin/classifier/classify-rules \
    -H "Content-Type: application/json" \
    -d '{"urls": ["https://example.com/blog/post"]}'
  3. Debug content parsing:

    curl -X POST https://<worker>/api/admin/classifier/debug-content \
    -H "Content-Type: application/json" \
    -d '{"url": "https://example.com"}'
  4. Test LLM classification:

    curl -X POST https://<worker>/api/admin/classifier/classify-llm \
    -H "Content-Type: application/json" \
    -d '{"url": "https://example.com"}'

Classification Source Values

SourceMeaning
rulesMatched by rules engine (FREE)
rules_vectorizeRules + Vectorize similarity
llmLLM was used for classification
cacheRetrieved from cache
human_feedbackManually corrected
failedClassification failed permanently
skipped_domain_unavailableDomain fetch failed

Low Confidence Classifications

Classifications below 65% confidence are flagged for review.

  1. View low confidence URLs:

    curl https://<worker>/api/admin/classifier/url-samples?max_confidence=65
  2. Get classification stats:

    curl https://<worker>/api/admin/classification-stats
  3. Submit correction:

    curl -X POST https://<worker>/api/admin/classifier/corrections \
    -H "Content-Type: application/json" \
    -d '{
    "url_id": 123,
    "is_correct": false,
    "corrected_page_type": "blog_post"
    }'

Queue Issues

Queue Status Monitoring

curl https://<worker>/api/admin/queues/status

Response interpretation:

{
"summary": {
"active_queues": 4,
"total_messages": 150,
"processing": 25,
"failed": 3,
"rate_per_min": 45
},
"queues": [
{
"name": "url-classify",
"messages": 100,
"status": "processing"
}
]
}

Messages Stuck in Queue

Symptoms:

  • Queue message count not decreasing
  • status: "idle" with messages pending

Resolution:

  1. Check if DRAIN_MODE is enabled:

    # Check
    curl https://<worker>/api/admin/config | grep DRAIN_MODE

    # Disable
    wrangler kv:key delete --binding=LOOKUP_CACHE "DRAIN_MODE"
  2. Verify queue consumers are bound:

    [[queues.consumers]]
    queue = "url-classify"
    max_batch_size = 50
    max_batch_timeout = 30
  3. Check for worker errors in logs:

    wrangler tail --format pretty

Dead Letter Queue (DLQ) Overflow

Messages that fail 3 times are moved to DLQ (if configured).

Resolution:

  1. Check DLQ contents via Cloudflare Dashboard

  2. Common causes of repeated failures:

    • Invalid URLs that can't be parsed
    • Domains that permanently block fetches
    • Database constraint violations
  3. Process failed messages manually or purge:

    # Via Cloudflare Dashboard: Queues > [queue-name] > Dead Letter Queue

Retry Exhaustion

The system uses exponential backoff with max 3 retries:

  • Retry 1: 5 seconds delay
  • Retry 2: 10 seconds delay
  • Retry 3: 20 seconds delay
  • After 3 failures: Message ACKed, logged to classification_failures

Check failed classifications:

curl https://<worker>/api/admin/domains/classification-failures

Workflow Failures

Common Workflow Errors

ErrorCauseResolution
Workflow errored: Step execution exceeded timeoutExternal API slowIncrease timeout or retry
Failed to resolve domain_idDomain not in DBEnsure domain exists first
Workflow still running after timeoutLong-running operationWait or check manually
DOMAIN_ONBOARD_WORKFLOW not configuredMissing bindingAdd to wrangler.toml

Workflow Recovery

Workflows are durable and automatically resume after failures.

  1. Check workflow status:

    curl https://<worker>/api/admin/workflow/domain-onboard/<id>
  2. View workflow history:

    curl https://<worker>/api/admin/workflow/domain-onboard/<id>/history
  3. Manually trigger retry:

    curl -X POST https://<worker>/api/admin/domains/onboard \
    -d '{"domain": "example.com"}'

DomainOnboardWorkflow Failures

Step failures and remediation:

StepFailureResolution
init-stateDB write failedCheck D1 connectivity
ensure-domainHash collisionRare, retry usually works
parallel-fetch-allDataForSEO timeoutAPI may be overloaded
store-backlinksBatch too largeReduce batch size
store-keywordsConstraint violationCheck for duplicates
finalizeStatus update failedRetry or manual update

DomainClassifyWorkflow Failures

Stage progression:

Stage 1 (TLD Rules) -> Stage 2 (Known Domains) -> Stage 3 (Vectorize) -> 
Stage 4 (Homepage) -> Stage 5 (LLM)

If workflow stops early, check which stage failed:

curl https://<worker>/api/admin/classifier/domain?domain=example.com

API Errors

Error Codes Reference

StatusErrorMeaningResolution
400domain is requiredMissing required parameterInclude domain in request
400AI binding not configuredMissing wrangler.toml configAdd AI binding
400VECTORIZE_BACKLINK binding requiredMissing VectorizeAdd Vectorize binding
401UnauthorizedMissing/invalid authCheck API keys
404Domain not foundDomain not in databaseOnboard domain first
404URL not foundURL not in databaseCheck URL_ID validity
429Rate limit exceededToo many requestsImplement backoff
500D1_ERRORDatabase errorCheck D1 logs
500Workflow failedInternal workflow errorCheck workflow status

Common API Error Responses

Missing credentials:

{
"error": "DataForSEO credentials not configured",
"hint": "Set DATAFORSEO_LOGIN and DATAFORSEO_PASSWORD in wrangler.toml"
}

Invalid URL:

{
"error": "url is required",
"status": 400
}

Workflow timeout:

{
"warning": "Workflow still running after timeout",
"workflow_id": "abc123",
"status": "running",
"status_url": "/api/admin/workflow/domain-classify/abc123"
}

Performance Issues

Slow Classification Responses

Symptoms:

  • Classification taking > 30 seconds
  • Timeouts on batch operations

Diagnosis:

  1. Check which stage is slow:

    curl -X POST https://<worker>/api/admin/classifier/classify \
    -d '{"url": "...", "use_workflow": true}'
    # Response includes stages_run with timing
  2. Check external API latency:

    • DataForSEO Instant Pages: typically 1-3s
    • Workers AI: typically 0.5-2s
    • Vectorize: typically 50-200ms

Resolution:

  1. Skip expensive stages for bulk operations:

    curl -X POST https://<worker>/api/admin/classifier/classify-rules \
    -d '{"urls": [...]}'
  2. Use async mode for batch classification:

    curl -X POST https://<worker>/api/admin/classifier/classify \
    -d '{"url": "...", "async": true}'

High Latency on Queue Processing

Symptoms:

  • Queue backlog growing
  • Processing rate < message arrival rate

Resolution:

  1. Increase batch size (up to 100):

    [[queues.consumers]]
    max_batch_size = 100
  2. Reduce per-message processing time:

    • Enable skip_content: true when SERP data available
    • Use rules-only classification for low-priority URLs
  3. Scale horizontally (multiple consumers):

    • Cloudflare automatically scales queue consumers

Timeout Handling

Default timeouts:

  • Worker request: 30 seconds
  • Workflow step: 2-10 minutes (configurable)
  • Queue message visibility: 5 minutes

Handling timeouts:

// In workflow
await step.do("my-step", {
timeout: "5 minutes",
retries: { limit: 2, delay: "30 seconds" }
}, async () => { ... });

Cost Overruns

Monitoring Costs

Real-time cost tracking:

curl https://<worker>/api/admin/costs

Daily breakdown:

curl https://<worker>/api/admin/costs/daily?days=7

By service:

curl https://<worker>/api/admin/costs/summary

Cost Per Operation

ServiceOperationCost
DataForSEOBacklinks fetch~$0.04/1000
DataForSEOKeywords fetch~$0.03/1000
DataForSEODomain summary~$0.02/call
DataForSEOInstant Pages~$0.000125/page
Workers AILLM (8B model)~$0.0001/call
Workers AILLM (70B model)~$0.0002/call
Workers AIEmbeddings~$0.00001/call
ZenRowsBasic credit~$0.000276/credit
ZenRowsPremium credit~$0.0069/credit

Budget Exceeded Handling

Symptoms:

  • Operations silently skipped
  • budget_exceeded: true in logs

Resolution:

  1. Check current budget status:

    curl https://<worker>/api/admin/costs/summary
  2. Reset daily budget (emergency):

    wrangler kv:key delete --binding=DFS_BUDGETS "daily_budget_$(date +%Y-%m-%d)"
  3. Increase budget limits:

    [vars]
    DAILY_COST_LIMIT_USD = "50.00"
    DATAFORSEO_DAILY_BUDGET = "25.00"

Reducing Costs

  1. Use rules-only classification first:

    • Rules engine is FREE
    • Only escalate to LLM when needed
  2. Leverage SERP data for ranking URLs:

    • Avoids $0.001+ content fetch per URL
    • Uses title/description from DataForSEO
  3. Enable Vectorize learning:

    • High-confidence classifications teach the system
    • Reduces LLM calls over time
  4. Set appropriate limits:

    {
    "backlinks_limit": 100,
    "keywords_limit": 100
    }

Debug Endpoints

Classification Debugging

EndpointMethodPurpose
/api/admin/classifier/classifyPOSTClassify single URL (full pipeline)
/api/admin/classifier/classify-rulesPOSTRules engine only (FREE)
/api/admin/classifier/classify-llmPOSTLLM stage only
/api/admin/classifier/debug-contentPOSTDebug content fetch
/api/admin/classifier/parse-contentPOSTParse page content
/api/admin/classifier/testGETTest with sample URLs
/api/admin/classifier/statsGETIndex and seed stats

Domain Debugging

EndpointMethodPurpose
/api/admin/domains/status?domain=XGETCheck onboarding status
/api/admin/domains/listGETList domains with filters
/api/admin/domains/classification-failuresGETView failed classifications
/api/admin/domains/recent-onboardsGETRecent onboarding history
/api/admin/domains/properties?domain_id=XGETGet brand properties

Queue Debugging

EndpointMethodPurpose
/api/admin/queues/statusGETAll queue statuses
/api/admin/domains/queue-classificationPOSTQueue URLs for classification
/api/admin/domains/reclassify-allPOSTReclassify URLs matching criteria

Cost Debugging

EndpointMethodPurpose
/api/admin/costsGETOverall cost stats
/api/admin/costs/dailyGETDaily breakdown
/api/admin/costs/recentGETRecent cost entries
/api/admin/costs/zenrowsGETZenRows quota status
/api/admin/costs/summaryGETComprehensive summary

Workflow Debugging

EndpointMethodPurpose
/api/admin/workflow/<type>/<id>GETWorkflow status
/api/admin/workflow/<type>/<id>/historyGETWorkflow step history

Logging

Viewing Logs

Real-time logs:

wrangler tail --format pretty

Filtered by tag:

wrangler tail --format pretty --search "url-classify"

Log Prefixes

PrefixComponent
[url-classify]URL classification consumer
[domain-classify]Domain classification consumer
[domain-onboard]Domain onboarding consumer
[admin-*]Admin endpoints
[workflow]Workflow execution
[cache]Cache operations
[dataforseo]DataForSEO API calls

What to Look For

Successful classification:

[url-classify] Classified URL example.com: page_type=blog_post, channel=media (85% confidence, $0.000100)

Failed classification:

[url-classify] Error processing classify_url, retrying (2 left): URL not found in database
[url-classify] Giving up on classify_url after 3 attempts: timeout

Queue processing:

[url-classify] Batch complete: 45/50 succeeded (PARALLEL), 5 target URLs, cost: $0.004500

Workflow progress:

[domain-onboard] Starting onboarding for domain: example.com (id: 123)
[domain-onboard] Fetched 100 backlinks
[domain-onboard] Stored 95 keywords
[domain-onboard] Finalized with status: completed

Log Levels

  • info: Normal operations
  • warn: Non-fatal issues (cache misses, retries)
  • error: Failures requiring attention

Escalation Path

Level 1: Self-Service (0-15 minutes)

  1. Check this troubleshooting guide
  2. Verify configuration in wrangler.toml
  3. Check Cloudflare Dashboard for queue/worker status
  4. Review recent logs with wrangler tail

Level 2: Investigation (15-60 minutes)

  1. Use debug endpoints to isolate issue
  2. Check external service status pages:
  3. Review recent deployments that may have introduced issues
  4. Check D1/KV/R2 metrics in Cloudflare Dashboard

Level 3: Engineering Escalation

When to escalate:

  • Data loss or corruption suspected
  • Security incident
  • Persistent failures after Level 1-2 resolution attempts
  • Cost overrun > 5x normal daily spend

Information to include:

  1. Timestamp of when issue started
  2. Affected domains/URLs
  3. Error messages from logs
  4. Steps already attempted
  5. Impact assessment (# of affected customers/operations)

Escalation contacts:

  • Primary: On-call engineer (check PagerDuty/Slack)
  • Secondary: Engineering lead
  • Data issues: Data engineering team

Recovery Procedures

Full system restart:

# Redeploy worker
wrangler deploy

# Clear all caches (use with caution)
wrangler kv:bulk delete --binding=LOOKUP_CACHE

# Reset queue (via Cloudflare Dashboard)

Selective recovery:

# Reclassify specific domain's URLs
curl -X POST https://<worker>/api/admin/domains/reclassify-all \
-d '{"domain": "example.com", "limit": 500}'

# Re-onboard specific domain
curl -X POST https://<worker>/api/admin/domains/onboard \
-d '{"domain": "example.com"}'

Quick Reference Card

Health Checks

# Overall system health
curl https://<worker>/test/clickhouse

# Queue health
curl https://<worker>/api/admin/queues/status

# Cost status
curl https://<worker>/api/admin/costs/summary

# Classification stats
curl https://<worker>/api/admin/classification-stats

Emergency Commands

# Enable drain mode (stops all queue processing)
wrangler kv:key put --binding=LOOKUP_CACHE "DRAIN_MODE" "true"

# Disable drain mode
wrangler kv:key delete --binding=LOOKUP_CACHE "DRAIN_MODE"

# Force redeploy
wrangler deploy --force

Useful Queries

-- Find stuck onboardings
SELECT domain, onboard_status, onboard_started_at
FROM domains
WHERE onboard_status = 'pending'
AND onboard_started_at < datetime('now', '-1 hour');

-- Find classification failures
SELECT domain, COUNT(*) as failed_count
FROM classification_failures
GROUP BY domain
ORDER BY failed_count DESC
LIMIT 10;

-- Find unclassified URLs
SELECT domain, COUNT(*) as unclassified
FROM urls
WHERE page_type IS NULL
GROUP BY domain
ORDER BY unclassified DESC
LIMIT 20;