RankDisco Classification Taxonomy
This document is the definitive reference for RankDisco's domain and URL classification system. It covers the complete taxonomy hierarchy, decision rules, mutual exclusivity constraints, and confidence thresholds.
Table of Contents
- Overview
- Tier-1 Archetypes
- Domain Types (Tier-2)
- Structural Types
- Page Types
- Decision Trees
- Mutual Exclusivity Rules
- Confidence Thresholds
- Quality Tiers
- Spam Tiers
Overview
RankDisco uses a hierarchical classification system with two primary axes:
-
Domain Classification - Classifies the entire domain/website
- Tier-1 (Archetype): Universal, immutable category
- Tier-2 (Domain Type): Specific type within the archetype
-
URL Classification - Classifies individual pages
- Structural Type: What the page DOES functionally
- Page Type: Specific page category within structural type
Key Principles
- Mutual Exclusivity: Every domain has exactly ONE tier1_type, every page has exactly ONE structural_type
- Exhaustive Coverage: Every domain and page maps to exactly one category at each level
- Orthogonal Dimensions: Quality, trust, freshness, and monetization are separate scoring dimensions, NOT archetypes
Tier-1 Archetypes
Tier-1 answers ONE question: "Where does the user extract primary value?"
These are immutable, mutually exclusive archetypes. A domain has exactly ONE tier1_type.
| Tier-1 | Value Extraction Mode | User Action | Example Domains |
|---|
platform | USE (interactive tooling) | Log in, do work | Slack, Notion, Zoom, GitHub, NordVPN |
marketplace | BROWSE + TRANSACT (multi-party) | Search, buy/sell | Amazon, eBay, Yelp, Zillow, Indeed, G2 |
commerce | TRANSACT (direct purchase) | Add to cart, pay | Nike.com, Apple Store, Warby Parker |
service | CONTACT/BOOK (conversion) | Get quote, book | Law firms, agencies, dentists, banks |
information | READ (consumptive) | Read, learn | NYTimes, TechCrunch, Wikipedia, Wirecutter |
community | PARTICIPATE (UGC-driven) | Post, discuss | Reddit, Discord, Stack Overflow, Quora |
institutional | TRUST (authority-backed) | Verify, reference | .gov sites, universities, nonprofits, WHO |
unknown | UNDETERMINED | - | Insufficient signals or confidence < 0.35 |
Tier-1 Decision Rules
Evaluate in priority order. Pick the FIRST match:
- PLATFORM - Users log in and USE a tool/app/dashboard
- MARKETPLACE - Lists items/businesses/people NOT owned by the domain
- COMMERCE - Sells products directly (has cart/checkout)
- SERVICE - Sells services (conversion is contact/quote/booking)
- INFORMATION - Primary value is content/articles/guides (read-only)
- COMMUNITY - User-generated content dominates (participation > consumption)
- INSTITUTIONAL - Government, education, nonprofit, religious authority
- UNKNOWN - Cannot determine (use sparingly, confidence < 0.35)
Tier-1 Invariants (Never Violate)
- Tier-1 is MUTUALLY EXCLUSIVE: a domain has exactly ONE tier1_type
- Tier-1 is EXHAUSTIVE: every domain maps to exactly one archetype
- Tier-1 does NOT encode: quality, freshness, trust, monetization, SEO abuse
- Those are ORTHOGONAL AXES, not archetypes
Known Pressure Points
The information archetype is intentionally broad (news, blogs, wikis, affiliate sites). This is correct at Tier-1; differentiation happens at Tier-2. Do NOT split information at Tier-1.
Domain Types (Tier-2)
Tier-2 answers: "What specific kind of [Tier-1] is this?"
Every domain_type has exactly ONE parent tier1_type. A domain_type CANNOT belong to multiple tier1 categories.
Platform Domain Types
| Domain Type | Description | Examples |
|---|
saas_product | Business software, productivity | Notion, Slack, Asana |
code_repository | Source code hosting | GitHub, GitLab, Bitbucket |
app_platform | App distribution | App Store listing pages |
documentation_portal | Hosted documentation | GitBook, ReadTheDocs, Docusaurus |
messaging_platform | Real-time communication | Slack, Discord, Teams |
social_network | Social networking | LinkedIn, Twitter/X, Facebook |
audio_platform | Audio streaming | Spotify, Apple Music, SoundCloud |
video_platform | Video streaming | Netflix, YouTube (as platform) |
Marketplace Domain Types
| Domain Type | Description | Examples |
|---|
ecommerce_marketplace | Multi-seller retail | Amazon, eBay, Etsy |
ticket_marketplace | Event tickets | Ticketmaster, StubHub, SeatGeek |
real_estate_marketplace | Property listings | Zillow, Realtor, Redfin |
job_marketplace | Job listings | Indeed, LinkedIn Jobs, Glassdoor |
service_marketplace | Freelance/gig work | Upwork, Fiverr, TaskRabbit |
app_marketplace | Software listings | Chrome Web Store, WordPress plugins |
review_marketplace | Business reviews + leads | G2, Capterra, TrustRadius |
directory_citation | Business directories | Yellow Pages, Yelp listings |
Commerce Domain Types
| Domain Type | Description | Examples |
|---|
ecommerce_store | D2C retail | Nike.com, Warby Parker, Allbirds |
travel_booking | Travel purchases | Airlines, hotels, car rental |
subscription_commerce | Recurring product delivery | Dollar Shave Club |
product_manufacturer | Brand/manufacturer sites | Apple, Samsung, Ford |
Service Domain Types
| Domain Type | Description | Examples |
|---|
agency_provider | Marketing, PR, design agencies | Digital agencies, dev shops |
pr_distribution | Press release wires | PR Newswire, Business Wire |
professional_service | Professional services | Accounting, consulting firms |
healthcare_provider | Healthcare services | Clinics, telehealth, dentists |
financial_service | Financial services | Fintech, insurance brokers |
legal_service | Legal services | Law firms, legal tech |
Information Domain Types
| Domain Type | Description | Examples |
|---|
news_publisher | Journalism organizations | NYTimes, BBC, Reuters |
magazine_publisher | Magazines/periodicals | Wired, The Atlantic |
blog_publisher | Blog platforms | Medium, Substack, personal blogs |
content_publisher | Generic content sites | How-to sites, recipe blogs, guides |
review_site | Editorial reviews | Wirecutter, CNET reviews |
affiliate_review_site | Affiliate-driven reviews | NerdWallet, affiliate blogs |
reference_wiki | Reference/encyclopedia | Wikipedia, Fandom wikis |
Community Domain Types
| Domain Type | Description | Examples |
|---|
forum_community | Discussion forums | Reddit, Discourse, phpBB |
gaming_community | Gaming forums/fan sites | IGN forums, GameFAQs |
sports_community | Sports fan communities | Team forums, fantasy sports |
qna_platform | Q&A sites | Stack Overflow, Quora |
ugc_video | User-generated video | YouTube (as UGC platform) |
Institutional Domain Types
| Domain Type | Description | Examples |
|---|
government_site | Government agencies | .gov sites, municipal sites |
education_academic | Educational institutions | Universities, schools |
nonprofit_org | Nonprofits/NGOs | Charities, foundations |
healthcare_institution | Healthcare organizations | Hospital systems, CDC, WHO |
financial_institution | Financial institutions | Federal Reserve, banks |
legal_institution | Legal institutions | Courts, bar associations |
trade_association | Industry associations | IEEE, ACM, trade groups |
Unknown/Risk Domain Types
| Domain Type | Description | Risk Level |
|---|
pbn_suspected | Suspected private blog network | High |
spam_low_quality | Known spam or very low quality | Critical |
unknown_other | Genuinely cannot classify | N/A |
Structural Types
Structural type answers: "What is this page DOING functionally?"
This is the parent layer for page_type. Classify structural_type FIRST, then page_type is constrained to valid children.
| Structural Type | Description | Primary Purpose |
|---|
article | Long-form written content | Read and learn |
detail | Shows one specific thing | View product/profile/video/event |
listing | Shows multiple items | Browse and compare |
thread | Conversational/sequential UGC | Discuss, Q&A |
utility | Functional pages | Login, checkout, settings |
corporate | Company info pages | About, contact, legal, homepage |
reference | Documentation/help content | Docs, wikis, FAQs, support |
spam | Low quality/malicious | TERMINAL OVERRIDE - stop classifying |
unknown | Could not determine | Insufficient signals |
Structural Type Examples
| Structural Type | URL Pattern Examples |
|---|
article | /blog/how-to-..., /news/2024/..., /guide/... |
detail | /products/widget, /user/john, /watch?v=... |
listing | /category/electronics, /search?q=..., /r/subreddit |
thread | /questions/12345, /t/topic/123, /comments/... |
utility | /login, /checkout, /settings, /download |
corporate | /about, /contact, /careers, /privacy |
reference | /docs/api, /wiki/Topic, /faq, /help/article |
spam | Parked domains, PBN content, malware pages |
Page Types
Page types are organized by category. Each page_type belongs to exactly ONE structural_type parent.
Content Page Types (structural_type: article)
| Page Type | Description | URL Pattern Examples |
|---|
article | Generic article | /article/topic-name |
blog_post | Blog post | /blog/post-title |
feature_article | In-depth feature | /features/deep-dive |
guide | Comprehensive guide | /guide/complete-guide-to |
howto_article | How-to instructions | /how-to/do-something |
listicle | List-based article | /10-best-tools-for |
news_article | News coverage | /news/2024/01/story |
opinion_article | Opinion/editorial | /opinion/my-take-on |
press_release | Press release | /press-releases/announcement |
recipe_page | Recipe content | /recipes/chocolate-cake |
research_article | Research/study | /research/new-findings |
ugc_article | User-generated article | Medium posts, Substack |
buying_guide | Purchase guide | /buying-guide/best-laptops |
review_page | Product/service review | /reviews/product-name |
case_study_page | Case study | /case-studies/client-success |
Commerce Page Types (mixed structural_types)
| Page Type | Structural Type | Description |
|---|
product_page | detail | Individual product |
app_page | detail | App listing |
auction_page | detail | Auction listing |
coupon_page | detail | Coupon/discount |
deal_page | detail | Deal/offer |
category_page | listing | Product category |
comparison_page | listing | Product comparison |
store_locator | listing | Store finder |
wishlist_page | listing | Saved items |
auto_generated_comparison | listing | Programmatic comparison |
booking_page | utility | Reservation page |
checkout_page | utility | Cart/checkout |
landing_page | utility | Marketing landing |
pricing_page | utility | Pricing plans |
sales_page | utility | Sales conversion |
| Page Type | Structural Type | Description |
|---|
profile_page | detail | User profile |
video_page | detail | Video content |
channel_page | detail | Content channel |
podcast_episode | detail | Podcast episode |
event_page | detail | Event details |
group_page | detail | Group/community |
live_stream_page | detail | Live streaming |
gallery_page | listing | Image/media gallery |
playlist_page | listing | Content playlist |
events_index | listing | Events list |
subreddit_index | listing | Subreddit listing |
forum_thread | thread | Forum discussion |
qna_page | thread | Q&A thread |
discussion_thread | thread | General discussion |
comment_thread | thread | Comment section |
social_post | thread | Social media post |
poll_page | utility | Poll/survey |
invite_page | utility | Invitation page |
Company Page Types (structural_type: corporate)
| Page Type | Description | URL Pattern Examples |
|---|
homepage | Site homepage | /, /en/ |
about_page | About us | /about, /company |
contact_page | Contact info | /contact, /get-in-touch |
careers_page | Job listings | /careers, /jobs |
team_page | Team members | /team, /leadership |
press_page | Press/media | /press, /newsroom |
partners_page | Partner info | /partners |
portfolio_page | Work portfolio | /portfolio, /work |
brand_page | Brand page | /brand, /brand-assets |
features_page | Product features | /features |
service_page | Service details | /services/consulting |
legal_page | Legal content | /legal |
legal_privacy_page | Privacy policy | /privacy, /privacy-policy |
legal_terms_page | Terms of service | /terms, /tos |
demo_page | Demo request | /demo, /request-demo |
login_page | User login | /login, /signin |
signup_page | User registration | /signup, /register |
settings_page | Account settings | /settings, /account |
Index Page Types (structural_type: listing)
| Page Type | Description | URL Pattern Examples |
|---|
category_index_page | Category listing | /blog, /news, /docs |
archive_page | Content archive | /archive, /tag/topic |
author_index | Author listing | /authors, /contributors |
directory_listing | Directory entry | /directory/business |
search_results_page | Search results | /search?q=term |
location_page | Location info | /locations/city-name |
redirect_page | Redirect target | N/A |
Reference Page Types (structural_type: reference)
| Page Type | Description | URL Pattern Examples |
|---|
documentation_page | Technical docs | /docs/getting-started |
api_reference_page | API documentation | /api/endpoints |
wiki_page | Wiki content | /wiki/Topic |
faq_page | FAQ content | /faq, /help/faq |
support_article | Help article | /help/how-to-reset |
pdf_document | PDF file | *.pdf |
course_page | Course content | /courses/intro-to-python |
repository_page | Code repository | /owner/repo |
tool_page | Interactive tool | /tools/calculator |
download_page | Download page | /download, /downloads |
Spam/Risk Page Types (structural_type: spam)
| Page Type | Description | Risk Level |
|---|
parked_domain_page | Parked domain | High |
pbn_article_page | PBN content | Critical |
comment_spam_page | Comment spam | Critical |
malicious_page | Malware/phishing | Critical |
Decision Trees
Domain Classification Decision Tree
START: Evaluate domain
|
v
1. Does user LOG IN to USE a tool/app/dashboard?
YES --> tier1: platform
NO --> continue
|
v
2. Does domain LIST items/businesses/people NOT owned by the domain?
YES --> tier1: marketplace
NO --> continue
|
v
3. Does domain have CART/CHECKOUT (sells products directly)?
YES --> tier1: commerce
NO --> continue
|
v
4. Does domain SELL SERVICES (contact/quote/booking conversion)?
YES --> tier1: service
NO --> continue
|
v
5. Is PRIMARY VALUE content consumption (articles/guides/news)?
YES --> tier1: information
NO --> continue
|
v
6. Does USER-GENERATED CONTENT dominate?
YES --> tier1: community
NO --> continue
|
v
7. Is this a GOVERNMENT, EDUCATION, or NONPROFIT entity?
YES --> tier1: institutional
NO --> continue
|
v
8. Confidence < 0.35 or insufficient signals?
YES --> tier1: unknown
URL Classification Decision Tree
START: Evaluate URL
|
v
1. Check for SPAM signals (PBN, malware, parked)
DETECTED --> structural_type: spam (TERMINAL - stop classifying)
NOT DETECTED --> continue
|
v
2. Is this ROOT PATH (/, /en/, etc.)?
YES --> structural_type: corporate, page_type: homepage
NO --> continue
|
v
3. Match against PLATFORM-SPECIFIC patterns
MATCH --> Use platform pattern (e.g., GitHub repo, YouTube video)
NO MATCH --> continue
|
v
4. Match against GENERIC PATH patterns
MATCH --> Use generic pattern confidence
NO MATCH --> continue
|
v
5. Apply semantic analysis
- Corporate keywords (about, contact, careers) --> corporate
- Article patterns (blog, news, guide) --> article
- Detail patterns (product, user, video) --> detail
- Listing patterns (category, search, archive) --> listing
- Thread patterns (thread, questions, comments) --> thread
- Reference patterns (docs, wiki, faq) --> reference
- Utility patterns (login, checkout, settings) --> utility
|
v
6. Insufficient signals?
YES --> structural_type: unknown
Mutual Exclusivity Rules
Domain Level
| Rule | Description |
|---|
| One tier1 per domain | A domain has exactly ONE tier1_type. Never classify as both platform AND information. |
| tier1 constrains domain_type | domain_type must be valid for the assigned tier1_type. A news_publisher cannot have tier1=platform. |
| No cross-tier leakage | Quality/trust signals do NOT determine tier1. Low-quality content is still information, not spam. |
URL Level
| Rule | Description |
|---|
| One structural_type per URL | A URL has exactly ONE structural_type. |
| structural_type constrains page_type | page_type must be valid for the assigned structural_type. |
| spam is terminal | Once structural_type: spam is assigned, stop classifying. Structure is irrelevant for spam. |
Prohibited Operations
-
Cross-tier leakage: Using quality/trust signals to determine Tier-1
- WRONG: "This is low-quality content, so it's spam"
- RIGHT: "This is information/blog_publisher with low quality_score"
-
Splitting information at Tier-1: Creating separate Tier-1 for news/blogs
- WRONG: Adding "media" or "editorial" as Tier-1 types
- RIGHT: Use Tier-2 domain_types under information
-
Industry in Tier-1/2: Using vertical (sports, gaming) as archetype
- WRONG: "gaming_site" as a Tier-1 or Tier-2 type
- RIGHT: "community/gaming_community" or "information/blog_publisher" + industry=gaming
-
Optional Tier-2: Allowing domains without domain_type
- WRONG: Storing only tier1_type without domain_type
- RIGHT: Every classified domain has BOTH tier1_type AND domain_type
Confidence Thresholds
Classification Source Priority
| Source | Priority | Description |
|---|
manual | 1 (highest) | Human-curated classification |
rule | 2 | Rules engine pattern match |
vector | 3 | Vectorize/embedding similarity |
llm | 4 | LLM-based classification |
merged | 5 (lowest) | Merged from multiple sources |
Confidence Score Thresholds
| Confidence | Interpretation | Action |
|---|
| 95-100 | Very High | Use directly |
| 85-94 | High | Use with standard validation |
| 70-84 | Moderate | Use with additional signals |
| 50-69 | Low | Consider LLM fallback |
| 35-49 | Very Low | LLM classification required |
| 0-34 | Insufficient | Classify as unknown |
Source-Specific Thresholds
| Source | Trust Threshold | Override Behavior |
|---|
| Manual curation | Always trusted | Overrides all other sources |
| Known domain database | 95+ confidence | Trusted for domain classification |
| TLD rules (.gov, .edu) | 90+ confidence | Strong domain_type signal |
| Platform URL patterns | 85+ confidence | Trusted for page_type |
| Generic URL patterns | 60-85 confidence | May need LLM confirmation |
| DataForSEO platform type | Variable | Useful for some types, garbage bucket for others |
| DFS Platform Type | Maps To | Notes |
|---|
news | news_publisher | Reliable |
blogs | blog_publisher | Reliable |
ecommerce | ecommerce_store | Reliable |
message-boards | forum_community | Reliable |
wikis | reference_wiki | Reliable |
social | social_network | Reliable |
educational | education_academic | Reliable |
governmental | government_site | Reliable |
directory | directory_citation | Reliable |
organization | - | Garbage bucket, needs classifier |
unknown | - | Garbage bucket, needs classifier |
cms | - | Garbage bucket, needs classifier |
Quality Tiers
Quality tiers measure publisher authority, separate from domain type classification.
| Tier | Domain Rank (DFS) | Description | Examples |
|---|
tier_1 | 800-1000 | Top 500 DR sites, household names | NYT, Forbes, WSJ |
tier_2 | 600-799 | DR 60-80, recognized in niche | TechCrunch, Ars Technica |
tier_3 | 400-599 | DR 40-60, decent but not authoritative | Mid-tier publications |
tier_4 | 200-399 | DR 20-40, low authority | Small blogs |
tier_5 | 0-199 | DR < 20, minimal authority | Personal sites |
unrated | N/A | No domain rank data available | - |
Tier Overrides
Some domains have manual tier overrides regardless of domain rank:
Tier 1 Overrides: nytimes.com, wsj.com, washingtonpost.com, forbes.com, bbc.com, cnn.com, reuters.com, bloomberg.com, theguardian.com, businessinsider.com
Tier 2 Overrides: techcrunch.com, theverge.com, wired.com, arstechnica.com, venturebeat.com, mashable.com, entrepreneur.com, inc.com, fastcompany.com
Spam Tiers
Spam tiers are based on backlink_spam_score (0-100) from DataForSEO.
| Tier | Score Range | Description | Recommended Action |
|---|
toxic | 80-100 | Critical spam signals | Should be disavowed |
high_risk | 60-79 | Significant spam signals | Likely problematic |
moderate | 40-59 | Some spam signals | Review recommended |
low_risk | 20-39 | Minor spam signals | Generally safe |
clean | 0-19 | No spam signals detected | Safe |
unknown | N/A | No spam score available | - |
URL Pattern Examples
| Pattern Name | Regex | Page Type | Confidence |
|---|
homepage_root | ^https?://[^/]+/?$ | homepage | 99 |
pdf_document | \.pdf(\?.*)?$ | pdf_document | 99 |
profile_at | /@[^/]+/?$ | profile_page | 90 |
blog_post | /(blog|blogs)/[^/]+/?$ | blog_post | 75 |
news_post | /(news)/[^/]+/?$ | news_article | 75 |
date_article_full | /\d{4}/\d{2}/\d{2}/[^/]+ | news_article | 85 |
product_page | /(products?)/[^/]+/?$ | product_page | 80 |
pricing_page | /(pricing|plans|packages)/?$ | pricing_page | 90 |
about_page | /(about|about-us|company)/?$ | about_page | 85 |
privacy_policy | /(privacy|privacy-policy) | legal_privacy_page | 95 |
forum_thread | /(thread|topic|discussion)/[^/]+ | forum_thread | 80 |
docs_page | /(docs?|documentation)/[^/]+ | documentation_page | 80 |
| Platform | Domain Match | Pattern | Page Type |
|---|
| Medium | medium.com | /@[^/]+/?$ | profile_page |
| Medium | medium.com | /@[^/]+/[^/]+ | blog_post |
| Substack | *.substack.com | /p/[^/]+ | blog_post |
| GitHub | github.com | /[^/]+/[^/]+/?$ | repository_page |
| GitHub | github.com | /[^/]+/[^/]+/issues/\d+ | forum_thread |
| YouTube | youtube.com | /watch\? | video_page |
| Stack Overflow | stackoverflow.com | /questions/\d+ | qna_page |
| Amazon | amazon.com | /dp/[A-Z0-9]{10} | product_page |
Appendix: Legacy Type Mappings
Some legacy domain types are aliased to current canonical types:
| Legacy Type | Canonical Type |
|---|
ugc_forum_community | forum_community |
ugc_qna | qna_platform |
ugc_video_platform | video_platform |
marketplace_platform | ecommerce_marketplace |
product_manufacturer_brand | product_manufacturer |
business_corporate_site | professional_service |
agency_service_provider | agency_provider |
service_business | professional_service |
hospital_system | healthcare_institution |
guest_post_network | spam_low_quality |
link_insertion_site | spam_low_quality |
niche_edit_network | spam_low_quality |
Version History
| Version | Date | Changes |
|---|
| 3.0 | 2024 | Formal tier1/tier2 specification, structural types, spam tiers |
| 2.0 | 2024 | V2 domain database (7,000+ domains), URL pattern engine |
| 1.0 | 2023 | Initial taxonomy with property types |