Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector Search (2023–2026)
Domain Sales Dataset v1.0
87,089 real domain sales | January 2023 – April 2026 | 21 enriched fields
Overview
This dataset contains verified domain name sales collected from daily NameBio sales reports published on the NamePros forum. Each record represents a unique domain sale with the original transaction data enriched with structural, linguistic, and AI-generated classification fields.
The data was collected in a single scraping session covering the full period from 2023-01-01 to 2026-04-02 across 1,184 daily report files. Duplicate sales of the same domain across multiple reports have been removed — each domain appears only once (the first recorded sale).
Note on coverage: A small number of daily reports may have been missing or skipped on the source forum. Coverage is estimated at 99%+ of published reports for the period.
Data source: NameBio daily sales reports published as user-generated content on the NamePros forum. Sales figures represent publicly reported transaction prices. The collection method is fully legal — the source data consists of factual transaction records in publicly accessible forum posts.
Archive Contents
datasetdomainsv1/
├── README.md ← this file
├── requirements.txt ← Python dependencies for vector DB
├── data/
│ ├── raw/ ← 1,184 original scraped JSON files (one per day)
│ │ ├── 2023-01-01.json
│ │ ├── 2023-01-02.json
│ │ └── ...
│ └── domains_enriched.jsonl ← 87,089 enriched domain records (primary dataset)
└── vector_db/
├── chroma/ ← ChromaDB vector database (ready to use)
└── search_example.py ← working semantic search example
Raw JSON files
Each file in data/raw/ corresponds to one daily NameBio report and contains:
Report date and source URL
Summary statistics (total volume, average price, top sale of the day)
Full list of reported sales with domain, TLD, price, and venue
Breakdowns by TLD type, individual TLD, venue, and domain category
These files contain the unmodified scraped data and serve as the ground truth source for the enriched dataset.
Primary Dataset: domains_enriched.jsonl
Each line is a valid JSON object representing one domain sale. Fields are divided into two groups: those generated by AI classification and those computed by deterministic code.
Example record
{
"domain": "vetec.com",
"price": 13678,
"industry_category": "Technology",
"sentiment": "Neutral",
"brandability_score": 8,
"keywordcommercialvalue": "Medium",
"tldkeywordmatch": 5,
"use_case": "SaaS/App",
"venue": "Sedo",
"sld": "vetec",
"tld": "com",
"tld_tier": 1,
"char_count": 5,
"contains_number": false,
"contains_hyphen": false,
"isdictionaryword": false,
"pattern_type": "LLLLL",
"tldishack": false,
"hack_word": null,
"word_count": 2,
"word_structure": "brandable",
"language_detected": "Generic",
"monetization_potential": "High",
"pronounceability_score": 10
}
Field Reference
Core fields (from raw data)
Field
Type
Description
domain
string
Full domain name (e.g. vetec.com)
price
integer
Sale price in USD
venue
string
Marketplace where the sale occurred (GoDaddy, Sedo, Afternic, DropCatch, Namecheap, Dynadot, etc.)
AI-generated fields
These fields were generated using GPT-5 nano via batch API. Accuracy varies by domain complexity — short, ambiguous, or non-English domains may have lower classification confidence. The classification prompt used industry-standard domain analysis criteria. In rare cases, the AI may have specified a non-existent field value.
Field
Type
Values
Description
industry_category
string
Technology, Finance, Health, Real Estate, Legal, E-commerce, Travel, Food & Beverage, Education, Entertainment, Sports, Automotive, Business, Adult, Crypto, AI & ML, Marketing, Media, Gaming, Generic
Best-fit industry vertical for the domain
sentiment
string
Positive, Neutral, Negative
Emotional tone implied by the domain name
brandability_score
integer
1–10
How memorable and brand-suitable the domain is. 10 = short, catchy, easy to recall (e.g. novo). 1 = long, awkward, hard to brand
keywordcommercialvalue
string
High, Medium, Low
Commercial intent and advertiser value of the keywords in the domain
tldkeywordmatch
integer
1–10
How well the TLD complements the SLD semantically. 5 = neutral standard TLDs (.com/.net/.org always score 5). 7 = .ai when SLD relates to AI. 9–10 = perfect semantic match (e.g. estate.today). Short brandable names on .ai score 5–6, not higher
use_case
string
Business Website, E-commerce, Blog/Media, SaaS/App, Personal Brand, Generic
Most likely intended use of the domain
Code-computed fields
These fields are determined algorithmically with high consistency. Accuracy depends on the complexity of the domain pattern.
Field
Type
Description
sld
string
Second-level domain — the part before the TLD (e.g. vetec from vetec.com)
tld
string
Top-level domain without the dot (e.g. com, ai, io)
tld_tier
integer
TLD tier classification (see TLD Tiers section below)
char_count
integer
Character count of the SLD only (hyphens included)
contains_number
boolean
Whether the SLD contains any digit
contains_hyphen
boolean
Whether the SLD contains a hyphen
isdictionaryword
boolean
Whether the full SLD (stripped of hyphens) is a known dictionary word in English, Spanish, German, or French
pattern_type
string or null
Character pattern using L (letter) and N (digit) notation for SLDs up to 6 characters. null for longer domains. Examples: LLL, LLLL, LLNN, LLLNNN
tldishack
boolean
Whether the SLD + TLD together form a real word (domain hack). Example: del.icio.us
hack_word
string or null
The full word formed if tldishack is true, otherwise null
word_count
integer
Estimated number of words in the SLD, detected via wordninja splitting
word_structure
string
Structural classification of the SLD (see Word Structure section below)
language_detected
string
Detected language of the SLD (see Language Detection section below)
monetization_potential
string
High / Medium / Low — derived from price: High ≥ $5,000, Medium ≥ $500, Low < $500
pronounceability_score
integer
1–10 — algorithmic score based on vowel ratio, consonant cluster length, and total character count. 10 = easy to pronounce, 1 = very hard
TLD Tiers
Tier
TLDs
Description
1
com
The gold standard
2
net, org, io, ai, co
Strong established extensions
3
app, dev, me, us, tv, biz, info, mobi
Mainstream niche extensions
4
de, fr, es, uk, ru, br, cn, jp, nl, pl, it, ca, au, mx, ar, ch, at, be, se, no, dk, fi, nz, za, in, pt, cz
Major country code TLDs
5
Everything else
All other TLDs
Word Structure Values
The word_structure field classifies the structural composition of the SLD:
Value
Description
Examples
single_word
A single real dictionary word
dogs.com, cloud.io
compound_words
Two or more real words combined or hyphenated
moonpool.com, dog-week.com
prefix+word
Known startup prefix + real word
getpaid.com, myteam.io, gofast.co
word+suffix
Real word + morphological suffix
cloudify.com, talkable.io
word+number
Real word followed by digits
trade365.com
number+word
Digits followed by a real word
99designs.com
pure_numeric
Only digits
8888.com, 12345.net
acronym
Short (≤5 chars), no digits, not a word, consonant-heavy
wcld.com, mrzx.io
short_code
Mixed letters and digits, short pattern
b2b.io, qp99.com, d444.com
brandable
Invented, pronounceable, does not fit other categories
plext.com, haikujam.com
random
Unpronounceable, no vowels or near-zero vowel ratio
xzpqt.com
Language Detection
The language_detected field uses the Lingua library with a minimum confidence threshold. Short SLDs (under 7 characters after stripping digits and hyphens) default to Generic as reliable detection is not possible for short strings.
Value
Meaning
English
SLD detected as English or composed of English words
Spanish
SLD detected as Spanish
German
SLD detected as German
French
SLD detected as French
Chinese
SLD detected as Chinese (pinyin romanization)
Other
Portuguese, Russian, Japanese, or other detected language
Generic
Too short to classify, or no language signal detected
Vector Database
The vector_db/chroma/ folder contains a ready-to-use ChromaDB vector database with semantic embeddings for all 87,089 domains.
Embedding model: sentence-transformers/all-MiniLM-L6-v2
Embedding dimensions: 384
Similarity metric: Cosine
Database size: ~387 MB
Each embedding was generated from a concatenated text representation of the domain's semantic fields:
{domain} {industrycategory} {usecase} {wordstructure} {wordstructurehint} {sentiment} {keywordcommercialvalue} {languagedetected}
All metadata fields are stored alongside each vector in ChromaDB and are available for filtering without any additional files.
Important: how to search effectively
The vector database is optimized for semantic queries — searching by meaning, industry, use case, and brand feel.
✅ Good semantic queries:
"AI startup innovative platform"
"medical health clinic"
"luxury travel premium brand"
"finance investment tool"
⚠️ Structural properties (word structure, TLD, numeric patterns) do not perform well as query text. Use where filters instead:
Find numeric domains on .com
search("investment finance", filters={
"$and": [
{"wordstructure": {"$eq": "purenumeric"}},
{"tld": {"$eq": "com"}}
]
})
Setup
pip install -r requirements.txt
requirements.txt:
chromadb
sentence-transformers
Basic usage
import chromadb
from sentence_transformers import SentenceTransformer
client = chromadb.PersistentClient(path="./vector_db/chroma")
collection = client.get_collection("domains")
model = SentenceTransformer("all-MiniLM-L6-v2")
def search(query, filters=None, top_k=10):
vec = model.encode([query], normalize_embeddings=True).tolist()
res = collection.query(
query_embeddings=vec,
nresults=topk,
where=filters,
include=["metadatas", "distances"]
)
return [{meta, "score": round(1 - dist, 3)}
for meta, dist in zip(res["metadatas"][0], res["distances"][0])]
Semantic search
results = search("AI startup brandable short domain")
With filters (multiple conditions require $and)
results = search("technology platform", filters={
"$and": [
{"tld": {"$eq": "com"}},
{"price": {"$lte": 5000}}
]
})
Available filter operators: $eq, $ne, $gt, $gte, $lt, $lte
Example output (search_example.py)
Running the included vectordb/searchexample.py produces results like these:
=== AI startup domains ===
autonomously.ai score=0.603 price=$1000 structure=brandable tld=ai
aiagent.bot score=0.577 price=$1282 structure=brandable tld=bot
nanobot.ai score=0.566 price=$50000 structure=brandable tld=ai
virally.ai score=0.562 price=$1590 structure=brandable tld=ai
reinvention.ai score=0.560 price=$50000 structure=brandable tld=ai
=== Brandable .com under $3000 ===
mementovivere.com score=0.694 price=$1460 structure=brandable tld=com
endsars.com score=0.667 price=$1525 structure=brandable tld=com
isawearthlings.com score=0.667 price=$2090 structure=brandable tld=com
=== Numeric domains .com ===
1789-1815.com score=0.019 price=$3300 structure=pure_numeric tld=com
668899.com score=-0.001 price=$1103 structure=pure_numeric tld=com
23999.com score=-0.002 price=$10500 structure=pure_numeric tld=com
=== High brandability score, .ai tld ===
rankable.ai score=0.373 price=$1324 structure=compound_words tld=ai
trainable.ai score=0.349 price=$1002 structure=brandable tld=ai
nanobot.ai score=0.345 price=$50000 structure=brandable tld=ai
=== Single word domains under $2000 ===
wellness.life score=0.589 price=$1390 structure=single_word tld=life
health.guru score=0.509 price=$1114 structure=single_word tld=guru
workout.me score=0.471 price=$1025 structure=single_word tld=me
Note on numeric domain scores: Near-zero or negative cosine scores for purenumeric domains are expected — numeric strings carry no semantic meaning for the embedding model. Always use wordstructure filter to retrieve numeric domains, as shown above.
Data Quality Notes
A small number of records (under 100) were excluded due to AI batch processing errors that produced malformed JSON output
Duplicate sales (same domain appearing in multiple daily reports) have been deduplicated — only the first recorded sale is kept
AI-classified fields (industrycategory, usecase, sentiment, etc.) were generated using a cost-optimized model. Accuracy is generally good for clear English-language domains and may be lower for short, ambiguous, non-English, or highly niche domains
language_detected returns Generic for any SLD shorter than 7 alphabetic characters, as short strings do not provide sufficient signal for reliable language detection
A small number of daily reports may be absent if they were not published or were inaccessible at collection time
License
You may use this dataset for any purpose — personal, research, or commercial — including using it to train models, build appraisal tools, power search products, or analyze domain market trends.
You may not:
Resell or redistribute the original files from this archive (the .jsonl, raw JSON files, or the vector database as-is)
Publish this dataset publicly for free download
You may:
Build and sell products or services based on this data
Use the vector database in a commercial production application
Create and sell derivative datasets (e.g. re-enriched with your own models)
Updates
This is v1.0 covering 2023-01-01 through 2026-04-02.
Future updates may include: extended date coverage, re-enrichment with higher-accuracy models, and additional computed fields. Updates are not guaranteed and depend on demand. If an update is released, buyers will receive it free of charge upon request. Notification will be sent where the selling platform allows it.
If you have specific field requests or find systematic classification errors, feel free to reach out.
Dataset compiled and enriched independently. Not affiliated with NameBio or NamePros.