RankDisco Troubleshooting Guide
This guide covers common issues, error messages, and step-by-step resolution procedures for RankDisco operations.
Table of Contents
- Common Issues
- Classification Failures
- Queue Issues
- Workflow Failures
- API Errors
- Performance Issues
- Cost Overruns
- Debug Endpoints
- Logging
- Escalation Path
Common Issues
1. Domain Onboarding Stuck in "Pending" State
Symptoms:
- Domain shows
onboard_status = 'pending'for extended period - No backlinks or keywords appearing
Error Messages:
[domain-onboard] DRAIN_MODE active, delaying messages
[domain-onboard] Workflow failed, falling back to queue
Resolution:
-
Check if DRAIN_MODE is enabled:
curl https://<worker>/api/admin/queues/status -
Verify workflow binding exists in
wrangler.toml:[[workflows]]
name = "domain-onboard"
binding = "DOMAIN_ONBOARD_WORKFLOW" -
Force restart the onboarding:
curl -X POST https://<worker>/api/admin/domains/onboard \
-H "Content-Type: application/json" \
-d '{"domain": "example.com"}' -
Check workflow status:
curl https://<worker>/api/admin/workflow/domain-onboard/<workflow_id>
2. URLs Not Classifying
Symptoms:
- URLs have
page_type = NULL classification_source = 'skipped_domain_unavailable'
Error Messages:
[url-classify] URL not found in database: https://... - will retry
[url-classify] Giving up on classify_url after 3 attempts
Resolution:
-
Check for race conditions (URL not yet in DB when classification queued):
curl -X POST https://<worker>/api/admin/domains/queue-classification \
-H "Content-Type: application/json" \
-d '{"domain": "example.com", "limit": 100}' -
Force reclassification:
curl -X POST https://<worker>/api/admin/domains/reclassify-all \
-H "Content-Type: application/json" \
-d '{"domain": "example.com", "page_type": null, "limit": 100}' -
Check classification failures:
curl https://<worker>/api/admin/domains/classification-failures
3. ClickHouse Connection Failures
Symptoms:
- Harvests complete but data not appearing in analytics
- Error logs showing connection timeouts
Error Messages:
ClickHouse connection failed: ECONNREFUSED
ClickHouse authentication failed
Resolution:
-
Verify ClickHouse connectivity:
curl https://<worker>/test/clickhouse -
Check credentials in Wrangler secrets:
wrangler secret list
# Verify: CLICKHOUSE_HOST, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD, CLICKHOUSE_DATABASE -
Test direct connection:
curl "https://<CLICKHOUSE_HOST>:8443/?user=<USER>&password=<PASS>&database=<DB>&query=SELECT%201"
4. DataForSEO Rate Limits
Symptoms:
- Backlink/keyword fetches failing
- Costs spiking unexpectedly
Error Messages:
DataForSEO rate limit exceeded
DataForSEO API error: 429
Resolution:
-
Check current quotas:
curl https://<worker>/api/admin/costs -
Adjust rate limits in Wrangler vars:
[vars]
DATAFORSEO_LABS_LIMIT = "100"
DATAFORSEO_LABS_MAX_REQUESTS = "50" -
Reset budget tracking if stuck:
wrangler kv:key delete --binding=DFS_BUDGETS "daily_budget"
5. Apple iTunes Rate Limiting
Symptoms:
- App details failing to fetch
itms-appss://URLs appearing in logs
Error Messages:
[app-details] rate limited by iTunes API
iTunes returned 403/429 status
Resolution:
-
Verify ZenRows API key is set:
wrangler secret list | grep ZENROWS -
Check ZenRows quota:
curl https://<worker>/api/admin/costs/zenrows -
The worker automatically backs off - wait 30-60 minutes for recovery.
6. Cloudflare Images Upload Failures
Symptoms:
- Icons not uploading for apps/domains
- Original URLs used as fallback
Error Messages:
CF Images upload failed: 401 Unauthorized
CF Images: missing API token
Resolution:
-
Verify CF Images secrets:
wrangler secret list | grep CF_IMAGES
# Required: CF_IMAGES_ACCOUNT_ID, CF_IMAGES_API_TOKEN, CF_IMAGES_ACCOUNT_HASH -
Test the Images API directly:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/images/v1" \
-H "Authorization: Bearer <API_TOKEN>" \
-F "url=https://example.com/icon.png"
7. Vectorize Index Empty or Stale
Symptoms:
- Classification always falling through to LLM
- High LLM costs for common domains
Error Messages:
[vectorize] No matches found with score > 0.5
Vectorize index empty or unavailable
Resolution:
-
Check Vectorize stats:
curl https://<worker>/api/admin/classifier/stats -
Initialize seed examples:
curl -X POST https://<worker>/api/admin/classifier/init-vectorize -
Bulk load from classified domains:
curl -X POST https://<worker>/api/admin/classifier/bulk-load-vectorize?min_confidence=0.85&limit=1000
8. Workflow Timeout Errors
Symptoms:
- Workflows stuck in "running" state
- Operations timing out after 2 minutes
Error Messages:
Workflow still running after timeout
Workflow errored: Step execution exceeded timeout
Resolution:
-
Check workflow status:
curl https://<worker>/api/admin/workflow/<workflow-type>/<workflow-id> -
Workflows have built-in retry logic. Wait for automatic recovery.
-
For persistent failures, check if external APIs are down:
- DataForSEO status page
- Cloudflare status page
9. D1 Database Errors
Symptoms:
- Insert/update failures
- "Too many subrequests" errors
Error Messages:
D1_ERROR: too many subrequests
D1_ERROR: UNIQUE constraint failed
D1_ERROR: SQLITE_BUSY
Resolution:
-
Too many subrequests: Batch operations are limited to ~50 statements per batch. Check batch sizes in code.
-
UNIQUE constraint: Check for duplicate data:
SELECT url_hash, COUNT(*) FROM urls GROUP BY url_hash HAVING COUNT(*) > 1; -
SQLITE_BUSY: Reduce concurrent writes or implement backoff.
10. Missing Environment Variables
Symptoms:
- Various "not configured" errors
- Features silently disabled
Error Messages:
DATAFORSEO_LOGIN not configured
AI binding not configured
VECTORIZE_BACKLINK binding required
Resolution:
-
Check all required bindings in
wrangler.toml:[vars]
# Required vars listed here
[[kv_namespaces]]
binding = "LOOKUP_CACHE"
[[d1_databases]]
binding = "DB"
[[queues.producers]]
binding = "DOMAIN_ONBOARD_QUEUE"
[[vectorize]]
binding = "VECTORIZE_BACKLINK"
[ai]
binding = "AI" -
Verify secrets are set:
wrangler secret list
Classification Failures
Why URLs Fail to Classify
Classification uses a multi-stage pipeline. Failures can occur at any stage:
| Stage | Cost | Common Failures |
|---|---|---|
| 0. Cache | FREE | Cache miss (expected) |
| 1. Rules | FREE | No matching pattern |
| 2. Vectorize | $0.0001 | Index empty, low similarity |
| 3. Content | $0.001-0.01 | Fetch blocked, timeout |
| 4. LLM | $0.001 | Rate limit, malformed response |
Debugging Classification
-
Test single URL classification:
curl -X POST https://<worker>/api/admin/classifier/classify \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article", "options": {"force_llm": false}}' -
Test rules only (no external calls):
curl -X POST https://<worker>/api/admin/classifier/classify-rules \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/blog/post"]}' -
Debug content parsing:
curl -X POST https://<worker>/api/admin/classifier/debug-content \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' -
Test LLM classification:
curl -X POST https://<worker>/api/admin/classifier/classify-llm \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Classification Source Values
| Source | Meaning |
|---|---|
rules | Matched by rules engine (FREE) |
rules_vectorize | Rules + Vectorize similarity |
llm | LLM was used for classification |
cache | Retrieved from cache |
human_feedback | Manually corrected |
failed | Classification failed permanently |
skipped_domain_unavailable | Domain fetch failed |
Low Confidence Classifications
Classifications below 65% confidence are flagged for review.
-
View low confidence URLs:
curl https://<worker>/api/admin/classifier/url-samples?max_confidence=65 -
Get classification stats:
curl https://<worker>/api/admin/classification-stats -
Submit correction:
curl -X POST https://<worker>/api/admin/classifier/corrections \
-H "Content-Type: application/json" \
-d '{
"url_id": 123,
"is_correct": false,
"corrected_page_type": "blog_post"
}'
Queue Issues
Queue Status Monitoring
curl https://<worker>/api/admin/queues/status
Response interpretation:
{
"summary": {
"active_queues": 4,
"total_messages": 150,
"processing": 25,
"failed": 3,
"rate_per_min": 45
},
"queues": [
{
"name": "url-classify",
"messages": 100,
"status": "processing"
}
]
}
Messages Stuck in Queue
Symptoms:
- Queue message count not decreasing
status: "idle"with messages pending
Resolution:
-
Check if DRAIN_MODE is enabled:
# Check
curl https://<worker>/api/admin/config | grep DRAIN_MODE
# Disable
wrangler kv:key delete --binding=LOOKUP_CACHE "DRAIN_MODE" -
Verify queue consumers are bound:
[[queues.consumers]]
queue = "url-classify"
max_batch_size = 50
max_batch_timeout = 30 -
Check for worker errors in logs:
wrangler tail --format pretty
Dead Letter Queue (DLQ) Overflow
Messages that fail 3 times are moved to DLQ (if configured).
Resolution:
-
Check DLQ contents via Cloudflare Dashboard
-
Common causes of repeated failures:
- Invalid URLs that can't be parsed
- Domains that permanently block fetches
- Database constraint violations
-
Process failed messages manually or purge:
# Via Cloudflare Dashboard: Queues > [queue-name] > Dead Letter Queue
Retry Exhaustion
The system uses exponential backoff with max 3 retries:
- Retry 1: 5 seconds delay
- Retry 2: 10 seconds delay
- Retry 3: 20 seconds delay
- After 3 failures: Message ACKed, logged to
classification_failures
Check failed classifications:
curl https://<worker>/api/admin/domains/classification-failures
Workflow Failures
Common Workflow Errors
| Error | Cause | Resolution |
|---|---|---|
Workflow errored: Step execution exceeded timeout | External API slow | Increase timeout or retry |
Failed to resolve domain_id | Domain not in DB | Ensure domain exists first |
Workflow still running after timeout | Long-running operation | Wait or check manually |
DOMAIN_ONBOARD_WORKFLOW not configured | Missing binding | Add to wrangler.toml |
Workflow Recovery
Workflows are durable and automatically resume after failures.
-
Check workflow status:
curl https://<worker>/api/admin/workflow/domain-onboard/<id> -
View workflow history:
curl https://<worker>/api/admin/workflow/domain-onboard/<id>/history -
Manually trigger retry:
curl -X POST https://<worker>/api/admin/domains/onboard \
-d '{"domain": "example.com"}'
DomainOnboardWorkflow Failures
Step failures and remediation:
| Step | Failure | Resolution |
|---|---|---|
init-state | DB write failed | Check D1 connectivity |
ensure-domain | Hash collision | Rare, retry usually works |
parallel-fetch-all | DataForSEO timeout | API may be overloaded |
store-backlinks | Batch too large | Reduce batch size |
store-keywords | Constraint violation | Check for duplicates |
finalize | Status update failed | Retry or manual update |
DomainClassifyWorkflow Failures
Stage progression:
Stage 1 (TLD Rules) -> Stage 2 (Known Domains) -> Stage 3 (Vectorize) ->
Stage 4 (Homepage) -> Stage 5 (LLM)
If workflow stops early, check which stage failed:
curl https://<worker>/api/admin/classifier/domain?domain=example.com
API Errors
Error Codes Reference
| Status | Error | Meaning | Resolution |
|---|---|---|---|
| 400 | domain is required | Missing required parameter | Include domain in request |
| 400 | AI binding not configured | Missing wrangler.toml config | Add AI binding |
| 400 | VECTORIZE_BACKLINK binding required | Missing Vectorize | Add Vectorize binding |
| 401 | Unauthorized | Missing/invalid auth | Check API keys |
| 404 | Domain not found | Domain not in database | Onboard domain first |
| 404 | URL not found | URL not in database | Check URL_ID validity |
| 429 | Rate limit exceeded | Too many requests | Implement backoff |
| 500 | D1_ERROR | Database error | Check D1 logs |
| 500 | Workflow failed | Internal workflow error | Check workflow status |
Common API Error Responses
Missing credentials:
{
"error": "DataForSEO credentials not configured",
"hint": "Set DATAFORSEO_LOGIN and DATAFORSEO_PASSWORD in wrangler.toml"
}
Invalid URL:
{
"error": "url is required",
"status": 400
}
Workflow timeout:
{
"warning": "Workflow still running after timeout",
"workflow_id": "abc123",
"status": "running",
"status_url": "/api/admin/workflow/domain-classify/abc123"
}
Performance Issues
Slow Classification Responses
Symptoms:
- Classification taking > 30 seconds
- Timeouts on batch operations
Diagnosis:
-
Check which stage is slow:
curl -X POST https://<worker>/api/admin/classifier/classify \
-d '{"url": "...", "use_workflow": true}'
# Response includes stages_run with timing -
Check external API latency:
- DataForSEO Instant Pages: typically 1-3s
- Workers AI: typically 0.5-2s
- Vectorize: typically 50-200ms
Resolution:
-
Skip expensive stages for bulk operations:
curl -X POST https://<worker>/api/admin/classifier/classify-rules \
-d '{"urls": [...]}' -
Use async mode for batch classification:
curl -X POST https://<worker>/api/admin/classifier/classify \
-d '{"url": "...", "async": true}'
High Latency on Queue Processing
Symptoms:
- Queue backlog growing
- Processing rate < message arrival rate
Resolution:
-
Increase batch size (up to 100):
[[queues.consumers]]
max_batch_size = 100 -
Reduce per-message processing time:
- Enable
skip_content: truewhen SERP data available - Use rules-only classification for low-priority URLs
- Enable
-
Scale horizontally (multiple consumers):
- Cloudflare automatically scales queue consumers
Timeout Handling
Default timeouts:
- Worker request: 30 seconds
- Workflow step: 2-10 minutes (configurable)
- Queue message visibility: 5 minutes
Handling timeouts:
// In workflow
await step.do("my-step", {
timeout: "5 minutes",
retries: { limit: 2, delay: "30 seconds" }
}, async () => { ... });
Cost Overruns
Monitoring Costs
Real-time cost tracking:
curl https://<worker>/api/admin/costs
Daily breakdown:
curl https://<worker>/api/admin/costs/daily?days=7
By service:
curl https://<worker>/api/admin/costs/summary
Cost Per Operation
| Service | Operation | Cost |
|---|---|---|
| DataForSEO | Backlinks fetch | ~$0.04/1000 |
| DataForSEO | Keywords fetch | ~$0.03/1000 |
| DataForSEO | Domain summary | ~$0.02/call |
| DataForSEO | Instant Pages | ~$0.000125/page |
| Workers AI | LLM (8B model) | ~$0.0001/call |
| Workers AI | LLM (70B model) | ~$0.0002/call |
| Workers AI | Embeddings | ~$0.00001/call |
| ZenRows | Basic credit | ~$0.000276/credit |
| ZenRows | Premium credit | ~$0.0069/credit |
Budget Exceeded Handling
Symptoms:
- Operations silently skipped
budget_exceeded: truein logs
Resolution:
-
Check current budget status:
curl https://<worker>/api/admin/costs/summary -
Reset daily budget (emergency):
wrangler kv:key delete --binding=DFS_BUDGETS "daily_budget_$(date +%Y-%m-%d)" -
Increase budget limits:
[vars]
DAILY_COST_LIMIT_USD = "50.00"
DATAFORSEO_DAILY_BUDGET = "25.00"
Reducing Costs
-
Use rules-only classification first:
- Rules engine is FREE
- Only escalate to LLM when needed
-
Leverage SERP data for ranking URLs:
- Avoids $0.001+ content fetch per URL
- Uses title/description from DataForSEO
-
Enable Vectorize learning:
- High-confidence classifications teach the system
- Reduces LLM calls over time
-
Set appropriate limits:
{
"backlinks_limit": 100,
"keywords_limit": 100
}
Debug Endpoints
Classification Debugging
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/classifier/classify | POST | Classify single URL (full pipeline) |
/api/admin/classifier/classify-rules | POST | Rules engine only (FREE) |
/api/admin/classifier/classify-llm | POST | LLM stage only |
/api/admin/classifier/debug-content | POST | Debug content fetch |
/api/admin/classifier/parse-content | POST | Parse page content |
/api/admin/classifier/test | GET | Test with sample URLs |
/api/admin/classifier/stats | GET | Index and seed stats |
Domain Debugging
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/domains/status?domain=X | GET | Check onboarding status |
/api/admin/domains/list | GET | List domains with filters |
/api/admin/domains/classification-failures | GET | View failed classifications |
/api/admin/domains/recent-onboards | GET | Recent onboarding history |
/api/admin/domains/properties?domain_id=X | GET | Get brand properties |
Queue Debugging
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/queues/status | GET | All queue statuses |
/api/admin/domains/queue-classification | POST | Queue URLs for classification |
/api/admin/domains/reclassify-all | POST | Reclassify URLs matching criteria |
Cost Debugging
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/costs | GET | Overall cost stats |
/api/admin/costs/daily | GET | Daily breakdown |
/api/admin/costs/recent | GET | Recent cost entries |
/api/admin/costs/zenrows | GET | ZenRows quota status |
/api/admin/costs/summary | GET | Comprehensive summary |
Workflow Debugging
| Endpoint | Method | Purpose |
|---|---|---|
/api/admin/workflow/<type>/<id> | GET | Workflow status |
/api/admin/workflow/<type>/<id>/history | GET | Workflow step history |
Logging
Viewing Logs
Real-time logs:
wrangler tail --format pretty
Filtered by tag:
wrangler tail --format pretty --search "url-classify"
Log Prefixes
| Prefix | Component |
|---|---|
[url-classify] | URL classification consumer |
[domain-classify] | Domain classification consumer |
[domain-onboard] | Domain onboarding consumer |
[admin-*] | Admin endpoints |
[workflow] | Workflow execution |
[cache] | Cache operations |
[dataforseo] | DataForSEO API calls |
What to Look For
Successful classification:
[url-classify] Classified URL example.com: page_type=blog_post, channel=media (85% confidence, $0.000100)
Failed classification:
[url-classify] Error processing classify_url, retrying (2 left): URL not found in database
[url-classify] Giving up on classify_url after 3 attempts: timeout
Queue processing:
[url-classify] Batch complete: 45/50 succeeded (PARALLEL), 5 target URLs, cost: $0.004500
Workflow progress:
[domain-onboard] Starting onboarding for domain: example.com (id: 123)
[domain-onboard] Fetched 100 backlinks
[domain-onboard] Stored 95 keywords
[domain-onboard] Finalized with status: completed
Log Levels
- info: Normal operations
- warn: Non-fatal issues (cache misses, retries)
- error: Failures requiring attention
Escalation Path
Level 1: Self-Service (0-15 minutes)
- Check this troubleshooting guide
- Verify configuration in
wrangler.toml - Check Cloudflare Dashboard for queue/worker status
- Review recent logs with
wrangler tail
Level 2: Investigation (15-60 minutes)
- Use debug endpoints to isolate issue
- Check external service status pages:
- Review recent deployments that may have introduced issues
- Check D1/KV/R2 metrics in Cloudflare Dashboard
Level 3: Engineering Escalation
When to escalate:
- Data loss or corruption suspected
- Security incident
- Persistent failures after Level 1-2 resolution attempts
- Cost overrun > 5x normal daily spend
Information to include:
- Timestamp of when issue started
- Affected domains/URLs
- Error messages from logs
- Steps already attempted
- Impact assessment (# of affected customers/operations)
Escalation contacts:
- Primary: On-call engineer (check PagerDuty/Slack)
- Secondary: Engineering lead
- Data issues: Data engineering team
Recovery Procedures
Full system restart:
# Redeploy worker
wrangler deploy
# Clear all caches (use with caution)
wrangler kv:bulk delete --binding=LOOKUP_CACHE
# Reset queue (via Cloudflare Dashboard)
Selective recovery:
# Reclassify specific domain's URLs
curl -X POST https://<worker>/api/admin/domains/reclassify-all \
-d '{"domain": "example.com", "limit": 500}'
# Re-onboard specific domain
curl -X POST https://<worker>/api/admin/domains/onboard \
-d '{"domain": "example.com"}'
Quick Reference Card
Health Checks
# Overall system health
curl https://<worker>/test/clickhouse
# Queue health
curl https://<worker>/api/admin/queues/status
# Cost status
curl https://<worker>/api/admin/costs/summary
# Classification stats
curl https://<worker>/api/admin/classification-stats
Emergency Commands
# Enable drain mode (stops all queue processing)
wrangler kv:key put --binding=LOOKUP_CACHE "DRAIN_MODE" "true"
# Disable drain mode
wrangler kv:key delete --binding=LOOKUP_CACHE "DRAIN_MODE"
# Force redeploy
wrangler deploy --force
Useful Queries
-- Find stuck onboardings
SELECT domain, onboard_status, onboard_started_at
FROM domains
WHERE onboard_status = 'pending'
AND onboard_started_at < datetime('now', '-1 hour');
-- Find classification failures
SELECT domain, COUNT(*) as failed_count
FROM classification_failures
GROUP BY domain
ORDER BY failed_count DESC
LIMIT 10;
-- Find unclassified URLs
SELECT domain, COUNT(*) as unclassified
FROM urls
WHERE page_type IS NULL
GROUP BY domain
ORDER BY unclassified DESC
LIMIT 20;