Docs
Blog
Pricing
Teams

Sign in Join for free

Teams
Search

Assets

Quests
Posts
APIs
Data

Teams
Search

Assets

Quests
Posts
APIs
Data

@domains · Ouro

Activity Feed

Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector Search (2023–2026)
.zip
Domain Sales Dataset v1.0 87,089 real domain sales | January 2023 – April 2026 | 21 enriched fields Overview This dataset contains verified domain name sales collected from daily NameBio sales reports published on the NamePros forum. Each record represents a unique domain sale with the original transaction data enriched with structural, linguistic, and AI-generated classification fields. The data was collected in a single scraping session covering the full period from 2023-01-01 to 2026-04-02 across 1,184 daily report files. Duplicate sales of the same domain across multiple reports have been removed — each domain appears only once (the first recorded sale). Note on coverage: A small number of daily reports may have been missing or skipped on the source forum. Coverage is estimated at 99%+ of published reports for the period. Data source: NameBio daily sales reports published as user-generated content on the NamePros forum. Sales figures represent publicly reported transaction prices. The collection method is fully legal — the source data consists of factual transaction records in publicly accessible forum posts. Archive Contents datasetdomainsv1/ ├── README.md ← this file ├── requirements.txt ← Python dependencies for vector DB ├── data/ │ ├── raw/ ← 1,184 original scraped JSON files (one per day) │ │ ├── 2023-01-01.json │ │ ├── 2023-01-02.json │ │ └── ... │ └── domains_enriched.jsonl ← 87,089 enriched domain records (primary dataset) └── vector_db/ ├── chroma/ ← ChromaDB vector database (ready to use) └── search_example.py ← working semantic search example Raw JSON files Each file in data/raw/ corresponds to one daily NameBio report and contains: Report date and source URL Summary statistics (total volume, average price, top sale of the day) Full list of reported sales with domain, TLD, price, and venue Breakdowns by TLD type, individual TLD, venue, and domain category These files contain the unmodified scraped data and serve as the ground truth source for the enriched dataset. Primary Dataset: domains_enriched.jsonl Each line is a valid JSON object representing one domain sale. Fields are divided into two groups: those generated by AI classification and those computed by deterministic code. Example record { "domain": "vetec.com", "price": 13678, "industry_category": "Technology", "sentiment": "Neutral", "brandability_score": 8, "keywordcommercialvalue": "Medium", "tldkeywordmatch": 5, "use_case": "SaaS/App", "venue": "Sedo", "sld": "vetec", "tld": "com", "tld_tier": 1, "char_count": 5, "contains_number": false, "contains_hyphen": false, "isdictionaryword": false, "pattern_type": "LLLLL", "tldishack": false, "hack_word": null, "word_count": 2, "word_structure": "brandable", "language_detected": "Generic", "monetization_potential": "High", "pronounceability_score": 10 } Field Reference Core fields (from raw data) Field Type Description domain string Full domain name (e.g. vetec.com) price integer Sale price in USD venue string Marketplace where the sale occurred (GoDaddy, Sedo, Afternic, DropCatch, Namecheap, Dynadot, etc.) AI-generated fields These fields were generated using GPT-5 nano via batch API. Accuracy varies by domain complexity — short, ambiguous, or non-English domains may have lower classification confidence. The classification prompt used industry-standard domain analysis criteria. In rare cases, the AI may have specified a non-existent field value. Field Type Values Description industry_category string Technology, Finance, Health, Real Estate, Legal, E-commerce, Travel, Food & Beverage, Education, Entertainment, Sports, Automotive, Business, Adult, Crypto, AI & ML, Marketing, Media, Gaming, Generic Best-fit industry vertical for the domain sentiment string Positive, Neutral, Negative Emotional tone implied by the domain name brandability_score integer 1–10 How memorable and brand-suitable the domain is. 10 = short, catchy, easy to recall (e.g. novo). 1 = long, awkward, hard to brand keywordcommercialvalue string High, Medium, Low Commercial intent and advertiser value of the keywords in the domain tldkeywordmatch integer 1–10 How well the TLD complements the SLD semantically. 5 = neutral standard TLDs (.com/.net/.org always score 5). 7 = .ai when SLD relates to AI. 9–10 = perfect semantic match (e.g. estate.today). Short brandable names on .ai score 5–6, not higher use_case string Business Website, E-commerce, Blog/Media, SaaS/App, Personal Brand, Generic Most likely intended use of the domain Code-computed fields These fields are determined algorithmically with high consistency. Accuracy depends on the complexity of the domain pattern. Field Type Description sld string Second-level domain — the part before the TLD (e.g. vetec from vetec.com) tld string Top-level domain without the dot (e.g. com, ai, io) tld_tier integer TLD tier classification (see TLD Tiers section below) char_count integer Character count of the SLD only (hyphens included) contains_number boolean Whether the SLD contains any digit contains_hyphen boolean Whether the SLD contains a hyphen isdictionaryword boolean Whether the full SLD (stripped of hyphens) is a known dictionary word in English, Spanish, German, or French pattern_type string or null Character pattern using L (letter) and N (digit) notation for SLDs up to 6 characters. null for longer domains. Examples: LLL, LLLL, LLNN, LLLNNN tldishack boolean Whether the SLD + TLD together form a real word (domain hack). Example: del.icio.us hack_word string or null The full word formed if tldishack is true, otherwise null word_count integer Estimated number of words in the SLD, detected via wordninja splitting word_structure string Structural classification of the SLD (see Word Structure section below) language_detected string Detected language of the SLD (see Language Detection section below) monetization_potential string High / Medium / Low — derived from price: High ≥ $5,000, Medium ≥ $500, Low < $500 pronounceability_score integer 1–10 — algorithmic score based on vowel ratio, consonant cluster length, and total character count. 10 = easy to pronounce, 1 = very hard TLD Tiers Tier TLDs Description 1 com The gold standard 2 net, org, io, ai, co Strong established extensions 3 app, dev, me, us, tv, biz, info, mobi Mainstream niche extensions 4 de, fr, es, uk, ru, br, cn, jp, nl, pl, it, ca, au, mx, ar, ch, at, be, se, no, dk, fi, nz, za, in, pt, cz Major country code TLDs 5 Everything else All other TLDs Word Structure Values The word_structure field classifies the structural composition of the SLD: Value Description Examples single_word A single real dictionary word dogs.com, cloud.io compound_words Two or more real words combined or hyphenated moonpool.com, dog-week.com prefix+word Known startup prefix + real word getpaid.com, myteam.io, gofast.co word+suffix Real word + morphological suffix cloudify.com, talkable.io word+number Real word followed by digits trade365.com number+word Digits followed by a real word 99designs.com pure_numeric Only digits 8888.com, 12345.net acronym Short (≤5 chars), no digits, not a word, consonant-heavy wcld.com, mrzx.io short_code Mixed letters and digits, short pattern b2b.io, qp99.com, d444.com brandable Invented, pronounceable, does not fit other categories plext.com, haikujam.com random Unpronounceable, no vowels or near-zero vowel ratio xzpqt.com Language Detection The language_detected field uses the Lingua library with a minimum confidence threshold. Short SLDs (under 7 characters after stripping digits and hyphens) default to Generic as reliable detection is not possible for short strings. Value Meaning English SLD detected as English or composed of English words Spanish SLD detected as Spanish German SLD detected as German French SLD detected as French Chinese SLD detected as Chinese (pinyin romanization) Other Portuguese, Russian, Japanese, or other detected language Generic Too short to classify, or no language signal detected Vector Database The vector_db/chroma/ folder contains a ready-to-use ChromaDB vector database with semantic embeddings for all 87,089 domains. Embedding model: sentence-transformers/all-MiniLM-L6-v2 Embedding dimensions: 384 Similarity metric: Cosine Database size: ~387 MB Each embedding was generated from a concatenated text representation of the domain's semantic fields: {domain} {industrycategory} {usecase} {wordstructure} {wordstructurehint} {sentiment} {keywordcommercialvalue} {languagedetected} All metadata fields are stored alongside each vector in ChromaDB and are available for filtering without any additional files. Important: how to search effectively The vector database is optimized for semantic queries — searching by meaning, industry, use case, and brand feel. ✅ Good semantic queries: "AI startup innovative platform" "medical health clinic" "luxury travel premium brand" "finance investment tool" ⚠️ Structural properties (word structure, TLD, numeric patterns) do not perform well as query text. Use where filters instead: Find numeric domains on .com search("investment finance", filters={ "$and": [ {"wordstructure": {"$eq": "purenumeric"}}, {"tld": {"$eq": "com"}} ] }) Setup pip install -r requirements.txt requirements.txt: chromadb sentence-transformers Basic usage import chromadb from sentence_transformers import SentenceTransformer client = chromadb.PersistentClient(path="./vector_db/chroma") collection = client.get_collection("domains") model = SentenceTransformer("all-MiniLM-L6-v2") def search(query, filters=None, top_k=10): vec = model.encode([query], normalize_embeddings=True).tolist() res = collection.query( query_embeddings=vec, nresults=topk, where=filters, include=["metadatas", "distances"] ) return [{meta, "score": round(1 - dist, 3)} for meta, dist in zip(res["metadatas"][0], res["distances"][0])] Semantic search results = search("AI startup brandable short domain") With filters (multiple conditions require $and) results = search("technology platform", filters={ "$and": [ {"tld": {"$eq": "com"}}, {"price": {"$lte": 5000}} ] }) Available filter operators: $eq, $ne, $gt, $gte, $lt, $lte Example output (search_example.py) Running the included vectordb/searchexample.py produces results like these: === AI startup domains === autonomously.ai score=0.603 price=$1000 structure=brandable tld=ai aiagent.bot score=0.577 price=$1282 structure=brandable tld=bot nanobot.ai score=0.566 price=$50000 structure=brandable tld=ai virally.ai score=0.562 price=$1590 structure=brandable tld=ai reinvention.ai score=0.560 price=$50000 structure=brandable tld=ai === Brandable .com under $3000 === mementovivere.com score=0.694 price=$1460 structure=brandable tld=com endsars.com score=0.667 price=$1525 structure=brandable tld=com isawearthlings.com score=0.667 price=$2090 structure=brandable tld=com === Numeric domains .com === 1789-1815.com score=0.019 price=$3300 structure=pure_numeric tld=com 668899.com score=-0.001 price=$1103 structure=pure_numeric tld=com 23999.com score=-0.002 price=$10500 structure=pure_numeric tld=com === High brandability score, .ai tld === rankable.ai score=0.373 price=$1324 structure=compound_words tld=ai trainable.ai score=0.349 price=$1002 structure=brandable tld=ai nanobot.ai score=0.345 price=$50000 structure=brandable tld=ai === Single word domains under $2000 === wellness.life score=0.589 price=$1390 structure=single_word tld=life health.guru score=0.509 price=$1114 structure=single_word tld=guru workout.me score=0.471 price=$1025 structure=single_word tld=me Note on numeric domain scores: Near-zero or negative cosine scores for purenumeric domains are expected — numeric strings carry no semantic meaning for the embedding model. Always use wordstructure filter to retrieve numeric domains, as shown above. Data Quality Notes A small number of records (under 100) were excluded due to AI batch processing errors that produced malformed JSON output Duplicate sales (same domain appearing in multiple daily reports) have been deduplicated — only the first recorded sale is kept AI-classified fields (industrycategory, usecase, sentiment, etc.) were generated using a cost-optimized model. Accuracy is generally good for clear English-language domains and may be lower for short, ambiguous, non-English, or highly niche domains language_detected returns Generic for any SLD shorter than 7 alphabetic characters, as short strings do not provide sufficient signal for reliable language detection A small number of daily reports may be absent if they were not published or were inaccessible at collection time License You may use this dataset for any purpose — personal, research, or commercial — including using it to train models, build appraisal tools, power search products, or analyze domain market trends. You may not: Resell or redistribute the original files from this archive (the .jsonl, raw JSON files, or the vector database as-is) Publish this dataset publicly for free download You may: Build and sell products or services based on this data Use the vector database in a commercial production application Create and sell derivative datasets (e.g. re-enriched with your own models) Updates This is v1.0 covering 2023-01-01 through 2026-04-02. Future updates may include: extended date coverage, re-enrichment with higher-accuracy models, and additional computed fields. Updates are not guaranteed and depend on demand. If an update is released, buyers will receive it free of charge upon request. Notification will be sent where the selling platform allows it. If you have specific field requests or find systematic classification errors, feel free to reach out. Dataset compiled and enriched independently. Not affiliated with NameBio or NamePros.
2mo

All caught up

@domains

30 XPLevel 1

0 followers 0 following

1 files 0 datasets 0 services 0 posts 0 quests

Badges

Organizations

No organizations yet

Teams