Purchase this content to unlock the full version
Need help? Visit your wallet settings to add bitcoin.
No compatible actions for files yet
87,089 real domain sales | January 2023 – April 2026 | 21 enriched fields
This dataset contains verified domain name sales collected from daily NameBio sales reports published on the NamePros forum. Each record represents a unique domain sale with the original transaction data enriched with structural, linguistic, and AI-generated classification fields.
The data was collected in a single scraping session covering the full period from 2023-01-01 to 2026-04-02 across 1,184 daily report files. Duplicate sales of the same domain across multiple reports have been removed — each domain appears only once (the first recorded sale).
Note on coverage: A small number of daily reports may have been missing or skipped on the source forum. Coverage is estimated at 99%+ of published reports for the period.
Data source: NameBio daily sales reports published as user-generated content on the NamePros forum. Sales figures represent publicly reported transaction prices. The collection method is fully legal — the source data consists of factual transaction records in publicly accessible forum posts.
dataset_domains_v1/ ├── README.md ← this file ├── requirements.txt ← Python dependencies for vector DB ├── data/ │ ├── raw/ ← 1,184 original scraped JSON files (one per day) │ │ ├── 2023-01-01.json │ │ ├── 2023-01-02.json │ │ └── ... │ └── domains_enriched.jsonl ← 87,089 enriched domain records (primary dataset) └── vector_db/ ├── chroma/ ← ChromaDB vector database (ready to use) └── search_example.py ← working semantic search example
Each file in data/raw/ corresponds to one daily NameBio report and contains:
Report date and source URL
Summary statistics (total volume, average price, top sale of the day)
Full list of reported sales with domain, TLD, price, and venue
Breakdowns by TLD type, individual TLD, venue, and domain category
These files contain the unmodified scraped data and serve as the ground truth source for the enriched dataset.
domains_enriched.jsonlEach line is a valid JSON object representing one domain sale. Fields are divided into two groups: those generated by AI classification and those computed by deterministic code.
{ "domain": "vetec.com", "price": 13678, "industry_category": "Technology", "sentiment": "Neutral", "brandability_score": 8, "keyword_commercial_value": "Medium", "tld_keyword_match": 5, "use_case": "SaaS/App", "venue": "Sedo", "sld": "vetec", "tld": "com", "tld_tier": 1, "char_count": 5, "contains_number": false, "contains_hyphen": false, "is_dictionary_word": false, "pattern_type": "LLLLL", "tld_is_hack": false, "hack_word": null, "word_count": 2, "word_structure": "brandable", "language_detected": "Generic", "monetization_potential": "High", "pronounceability_score": 10 }
Field | Type | Description |
|---|---|---|
| string | Full domain name (e.g. |
| integer | Sale price in USD |
These fields were generated using GPT-5 nano via batch API. Accuracy varies by domain complexity — short, ambiguous, or non-English domains may have lower classification confidence. The classification prompt used industry-standard domain analysis criteria. In rare cases, the AI may have specified a non-existent field value.
Field | Type | Values | Description |
|---|---|---|---|
| string | Technology, Finance, Health, Real Estate, Legal, E-commerce, Travel, Food & Beverage, Education, Entertainment, Sports, Automotive, Business, Adult, Crypto, AI & ML, Marketing, Media, Gaming, Generic | Best-fit industry vertical for the domain |
|
These fields are determined algorithmically with high consistency. Accuracy depends on the complexity of the domain pattern.
Field | Type | Description |
|---|---|---|
| string | Second-level domain — the part before the TLD (e.g. |
| string |
Tier | TLDs | Description |
|---|---|---|
1 |
| The gold standard |
2 |
|
The word_structure field classifies the structural composition of the SLD:
Value | Description | Examples |
|---|---|---|
| A single real dictionary word |
|
| Two or more real words combined or hyphenated |
The language_detected field uses the Lingua library with a minimum confidence threshold. Short SLDs (under 7 characters after stripping digits and hyphens) default to Generic as reliable detection is not possible for short strings.
Value | Meaning |
|---|---|
| SLD detected as English or composed of English words |
| SLD detected as Spanish |
| SLD detected as German |
|
The vector_db/chroma/ folder contains a ready-to-use ChromaDB vector database with semantic embeddings for all 87,089 domains.
Embedding model: sentence-transformers/all-MiniLM-L6-v2
Embedding dimensions: 384
Similarity metric: Cosine
Database size: ~387 MB
Each embedding was generated from a concatenated text representation of the domain's semantic fields:
{domain} {industry_category} {use_case} {word_structure} {word_structure_hint} {sentiment} {keyword_commercial_value} {language_detected}
All metadata fields are stored alongside each vector in ChromaDB and are available for filtering without any additional files.
The vector database is optimized for semantic queries — searching by meaning, industry, use case, and brand feel.
✅ Good semantic queries:
"AI startup innovative platform" "medical health clinic" "luxury travel premium brand" "finance investment tool"
⚠️ Structural properties (word structure, TLD, numeric patterns) do not perform well as query text. Use where filters instead:
# Find numeric domains on .com search("investment finance", filters={ "$and": [ {"word_structure": {"$eq": "pure_numeric"}}, {"tld": {"$eq": "com"}} ] })
pip install -r requirements.txt
requirements.txt:
chromadb sentence-transformers
import chromadb from sentence_transformers import SentenceTransformer client = chromadb.PersistentClient(path="./vector_db/chroma") collection = client.get_collection("domains") model = SentenceTransformer("all-MiniLM-L6-v2") def search(query, filters=None, top_k=10): vec = model.encode([query], normalize_embeddings=True).tolist() res = collection.query( query_embeddings=vec, n_results=top_k, where=filters, include=["metadatas", "distances"] ) return [{**meta, "score": round(1 - dist, 3)} for meta, dist in zip(res["metadatas"][0], res["distances"][0])] # Semantic search results = search("AI startup brandable short domain") # With filters (multiple conditions require $and) results = search("technology platform", filters={ "$and": [ {"tld": {"$eq": "com"}}, {"price": {"$lte": 5000}} ] })
Available filter operators: $eq, $ne, $gt, $gte, $lt, $lte
search_example.py)Running the included vector_db/search_example.py produces results like these:
=== AI startup domains === 1. autonomously.ai score=0.603 price=$1000 structure=brandable tld=ai 2. aiagent.bot score=0.577 price=$1282 structure=brandable tld=bot 3. nanobot.ai score=0.566 price=$50000 structure=brandable tld=ai 4. virally.ai score=0.562 price=$1590 structure=brandable tld=ai 5. reinvention.ai score=0.560 price=$50000 structure=brandable tld=ai === Brandable .com under $3000 === 1. mementovivere.com score=0.694 price=$1460 structure=brandable tld=com 2. endsars.com score=0.667 price=$1525 structure=brandable tld=com 3. isawearthlings.com score=0.667 price=$2090 structure=brandable tld=com === Numeric domains .com === 1. 1789-1815.com score=0.019 price=$3300 structure=pure_numeric tld=com 2. 668899.com score=-0.001 price=$1103 structure=pure_numeric tld=com 3. 23999.com score=-0.002 price=$10500 structure=pure_numeric tld=com === High brandability score, .ai tld === 1. rankable.ai score=0.373 price=$1324 structure=compound_words tld=ai 2. trainable.ai score=0.349 price=$1002 structure=brandable tld=ai 3. nanobot.ai score=0.345 price=$50000 structure=brandable tld=ai === Single word domains under $2000 === 1. wellness.life score=0.589 price=$1390 structure=single_word tld=life 2. health.guru score=0.509 price=$1114 structure=single_word tld=guru 3. workout.me score=0.471 price=$1025 structure=single_word tld=me
Note on numeric domain scores: Near-zero or negative cosine scores for
pure_numericdomains are expected — numeric strings carry no semantic meaning for the embedding model. Always useword_structurefilter to retrieve numeric domains, as shown above.
A small number of records (under 100) were excluded due to AI batch processing errors that produced malformed JSON output
Duplicate sales (same domain appearing in multiple daily reports) have been deduplicated — only the first recorded sale is kept
AI-classified fields (industry_category, use_case, sentiment, etc.) were generated using a cost-optimized model. Accuracy is generally good for clear English-language domains and may be lower for short, ambiguous, non-English, or highly niche domains
language_detected returns Generic for any SLD shorter than 7 alphabetic characters, as short strings do not provide sufficient signal for reliable language detection
A small number of daily reports may be absent if they were not published or were inaccessible at collection time
You may use this dataset for any purpose — personal, research, or commercial — including using it to train models, build appraisal tools, power search products, or analyze domain market trends.
You may not:
Resell or redistribute the original files from this archive (the .jsonl, raw JSON files, or the vector database as-is)
Publish this dataset publicly for free download
You may:
Build and sell products or services based on this data
Use the vector database in a commercial production application
Create and sell derivative datasets (e.g. re-enriched with your own models)
This is v1.0 covering 2023-01-01 through 2026-04-02.
Future updates may include: extended date coverage, re-enrichment with higher-accuracy models, and additional computed fields. Updates are not guaranteed and depend on demand. If an update is released, buyers will receive it free of charge upon request. Notification will be sent where the selling platform allows it.
If you have specific field requests or find systematic classification errors, feel free to reach out.
Dataset compiled and enriched independently. Not affiliated with NameBio or NamePros.
venue
string |
Marketplace where the sale occurred (GoDaddy, Sedo, Afternic, DropCatch, Namecheap, Dynadot, etc.) |
Positive, Neutral, Negative |
Emotional tone implied by the domain name |
| integer | 1–10 | How memorable and brand-suitable the domain is. 10 = short, catchy, easy to recall (e.g. |
| string | High, Medium, Low | Commercial intent and advertiser value of the keywords in the domain |
| integer | 1–10 | How well the TLD complements the SLD semantically. 5 = neutral standard TLDs (.com/.net/.org always score 5). 7 = .ai when SLD relates to AI. 9–10 = perfect semantic match (e.g. |
| string | Business Website, E-commerce, Blog/Media, SaaS/App, Personal Brand, Generic | Most likely intended use of the domain |
Top-level domain without the dot (e.g. com, ai, io)
| integer | TLD tier classification (see TLD Tiers section below) |
| integer | Character count of the SLD only (hyphens included) |
| boolean | Whether the SLD contains any digit |
| boolean | Whether the SLD contains a hyphen |
| boolean | Whether the full SLD (stripped of hyphens) is a known dictionary word in English, Spanish, German, or French |
| string or null | Character pattern using L (letter) and N (digit) notation for SLDs up to 6 characters. |
| boolean | Whether the SLD + TLD together form a real word (domain hack). Example: |
| string or null | The full word formed if |
| integer | Estimated number of words in the SLD, detected via wordninja splitting |
| string | Structural classification of the SLD (see Word Structure section below) |
| string | Detected language of the SLD (see Language Detection section below) |
| string | High / Medium / Low — derived from price: High ≥ $5,000, Medium ≥ $500, Low < $500 |
| integer | 1–10 — algorithmic score based on vowel ratio, consonant cluster length, and total character count. 10 = easy to pronounce, 1 = very hard |
coStrong established extensions |
3 |
| Mainstream niche extensions |
4 |
| Major country code TLDs |
5 | Everything else | All other TLDs |
dog-week.com
| Known startup prefix + real word |
|
| Real word + morphological suffix |
|
| Real word followed by digits |
|
| Digits followed by a real word |
|
| Only digits |
|
| Short (≤5 chars), no digits, not a word, consonant-heavy |
|
| Mixed letters and digits, short pattern |
|
| Invented, pronounceable, does not fit other categories |
|
| Unpronounceable, no vowels or near-zero vowel ratio |
|
SLD detected as French
| SLD detected as Chinese (pinyin romanization) |
| Portuguese, Russian, Japanese, or other detected language |
| Too short to classify, or no language signal detected |