Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector S... · Files on Ouro

Domain Sales Dataset v1.0

Name: Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector Search (2023–2026)
Creator: domains
License: https://en.wikipedia.org/wiki/All_rights_reserved

87,089 real domain sales | January 2023 – April 2026 | 21 enriched fields

Overview

This dataset contains verified domain name sales collected from daily NameBio sales reports published on the NamePros forum. Each record represents a unique domain sale with the original transaction data enriched with structural, linguistic, and AI-generated classification fields.

The data was collected in a single scraping session covering the full period from 2023-01-01 to 2026-04-02 across 1,184 daily report files. Duplicate sales of the same domain across multiple reports have been removed — each domain appears only once (the first recorded sale).

Note on coverage: A small number of daily reports may have been missing or skipped on the source forum. Coverage is estimated at 99%+ of published reports for the period.

Data source: NameBio daily sales reports published as user-generated content on the NamePros forum. Sales figures represent publicly reported transaction prices. The collection method is fully legal — the source data consists of factual transaction records in publicly accessible forum posts.

Archive Contents

plaintext

dataset_domains_v1/
├── README.md                    ← this file
├── requirements.txt             ← Python dependencies for vector DB
├── data/
│   ├── raw/                     ← 1,184 original scraped JSON files (one per day)
│   │   ├── 2023-01-01.json
│   │   ├── 2023-01-02.json
│   │   └── ...
│   └── domains_enriched.jsonl   ← 87,089 enriched domain records (primary dataset)
└── vector_db/
    ├── chroma/                  ← ChromaDB vector database (ready to use)
    └── search_example.py        ← working semantic search example

Raw JSON files

Each file in data/raw/ corresponds to one daily NameBio report and contains:

Report date and source URL
Summary statistics (total volume, average price, top sale of the day)
Full list of reported sales with domain, TLD, price, and venue
Breakdowns by TLD type, individual TLD, venue, and domain category

These files contain the unmodified scraped data and serve as the ground truth source for the enriched dataset.

Primary Dataset: `domains_enriched.jsonl`

Each line is a valid JSON object representing one domain sale. Fields are divided into two groups: those generated by AI classification and those computed by deterministic code.

Example record

json

{
  "domain": "vetec.com",
  "price": 13678,
  "industry_category": "Technology",
  "sentiment": "Neutral",
  "brandability_score": 8,
  "keyword_commercial_value": "Medium",
  "tld_keyword_match": 5,
  "use_case": "SaaS/App",
  "venue": "Sedo",
  "sld": "vetec",
  "tld": "com",
  "tld_tier": 1,
  "char_count": 5,
  "contains_number": false,
  "contains_hyphen": false,
  "is_dictionary_word": false,
  "pattern_type": "LLLLL",
  "tld_is_hack": false,
  "hack_word": null,
  "word_count": 2,
  "word_structure": "brandable",
  "language_detected": "Generic",
  "monetization_potential": "High",
  "pronounceability_score": 10
}

Field Reference

Core fields (from raw data)

Field	Type	Description
`domain`	string	Full domain name (e.g. `vetec.com`)
`price`	integer

AI-generated fields

These fields were generated using GPT-5 nano via batch API. Accuracy varies by domain complexity — short, ambiguous, or non-English domains may have lower classification confidence. The classification prompt used industry-standard domain analysis criteria. In rare cases, the AI may have specified a non-existent field value.

Field	Type	Values	Description
`industry_category`	string	Technology, Finance, Health, Real Estate, Legal, E-commerce, Travel, Food & Beverage, Education, Entertainment, Sports, Automotive, Business, Adult, Crypto, AI & ML, Marketing, Media, Gaming, Generic	Best-fit industry vertical for the domain

Code-computed fields

These fields are determined algorithmically with high consistency. Accuracy depends on the complexity of the domain pattern.

Field	Type	Description
`sld`	string	Second-level domain — the part before the TLD (e.g. `vetec` from `vetec.com`)
`tld`

TLD Tiers

Tier	TLDs	Description
1	`com`	The gold standard
2	`net`, `org`,

Word Structure Values

The word_structure field classifies the structural composition of the SLD:

Value	Description	Examples
`single_word`	A single real dictionary word	`dogs.com`, `cloud.io`
`compound_words`	Two or more real words combined or hyphenated

Language Detection

The language_detected field uses the Lingua library with a minimum confidence threshold. Short SLDs (under 7 characters after stripping digits and hyphens) default to Generic as reliable detection is not possible for short strings.

Value	Meaning
`English`	SLD detected as English or composed of English words
`Spanish`	SLD detected as Spanish
`German`	SLD detected as German

Vector Database

The vector_db/chroma/ folder contains a ready-to-use ChromaDB vector database with semantic embeddings for all 87,089 domains.

Embedding model: sentence-transformers/all-MiniLM-L6-v2
Embedding dimensions: 384
Similarity metric: Cosine
Database size: ~387 MB

Each embedding was generated from a concatenated text representation of the domain's semantic fields:

plaintext

{domain} {industry_category} {use_case} {word_structure} {word_structure_hint} {sentiment} {keyword_commercial_value} {language_detected}

All metadata fields are stored alongside each vector in ChromaDB and are available for filtering without any additional files.

Important: how to search effectively

The vector database is optimized for semantic queries — searching by meaning, industry, use case, and brand feel.

✅ Good semantic queries:

plaintext

"AI startup innovative platform"
"medical health clinic"
"luxury travel premium brand"
"finance investment tool"

⚠️ Structural properties (word structure, TLD, numeric patterns) do not perform well as query text. Use where filters instead:

python

# Find numeric domains on .com
search("investment finance", filters={
    "$and": [
        {"word_structure": {"$eq": "pure_numeric"}},
        {"tld": {"$eq": "com"}}
    ]
})

Setup

bash

pip install -r requirements.txt

requirements.txt:

plaintext

chromadb
sentence-transformers

Basic usage

python

import chromadb
from sentence_transformers import SentenceTransformer

client     = chromadb.PersistentClient(path="./vector_db/chroma")
collection = client.get_collection("domains")
model      = SentenceTransformer("all-MiniLM-L6-v2")

def search(query, filters=None, top_k=10):
    vec = model.encode([query], normalize_embeddings=True).tolist()
    res = collection.query(
        query_embeddings=vec,
        n_results=top_k,
        where=filters,
        include=["metadatas", "distances"]
    )
    return [{**meta, "score": round(1 - dist, 3)}
            for meta, dist in zip(res["metadatas"][0], res["distances"][0])]

# Semantic search
results = search("AI startup brandable short domain")

# With filters (multiple conditions require $and)
results = search("technology platform", filters={
    "$and": [
        {"tld":   {"$eq": "com"}},
        {"price": {"$lte": 5000}}
    ]
})

Available filter operators: $eq, $ne, $gt, $gte, $lt, $lte

Example output (`search_example.py`)

Running the included vector_db/search_example.py produces results like these:

plaintext

=== AI startup domains ===
   1. autonomously.ai    score=0.603  price=$1000   structure=brandable  tld=ai
   2. aiagent.bot        score=0.577  price=$1282   structure=brandable  tld=bot
   3. nanobot.ai         score=0.566  price=$50000  structure=brandable  tld=ai
   4. virally.ai         score=0.562  price=$1590   structure=brandable  tld=ai
   5. reinvention.ai     score=0.560  price=$50000  structure=brandable  tld=ai

=== Brandable .com under $3000 ===
   1. mementovivere.com  score=0.694  price=$1460   structure=brandable  tld=com
   2. endsars.com        score=0.667  price=$1525   structure=brandable  tld=com
   3. isawearthlings.com score=0.667  price=$2090   structure=brandable  tld=com

=== Numeric domains .com ===
   1. 1789-1815.com      score=0.019  price=$3300   structure=pure_numeric  tld=com
   2. 668899.com         score=-0.001 price=$1103   structure=pure_numeric  tld=com
   3. 23999.com          score=-0.002 price=$10500  structure=pure_numeric  tld=com

=== High brandability score, .ai tld ===
   1. rankable.ai        score=0.373  price=$1324   structure=compound_words  tld=ai
   2. trainable.ai       score=0.349  price=$1002   structure=brandable       tld=ai
   3. nanobot.ai         score=0.345  price=$50000  structure=brandable       tld=ai

=== Single word domains under $2000 ===
   1. wellness.life      score=0.589  price=$1390   structure=single_word  tld=life
   2. health.guru        score=0.509  price=$1114   structure=single_word  tld=guru
   3. workout.me         score=0.471  price=$1025   structure=single_word  tld=me

Note on numeric domain scores: Near-zero or negative cosine scores for pure_numeric domains are expected — numeric strings carry no semantic meaning for the embedding model. Always use word_structure filter to retrieve numeric domains, as shown above.

Data Quality Notes

A small number of records (under 100) were excluded due to AI batch processing errors that produced malformed JSON output
Duplicate sales (same domain appearing in multiple daily reports) have been deduplicated — only the first recorded sale is kept
AI-classified fields (industry_category, use_case, sentiment, etc.) were generated using a cost-optimized model. Accuracy is generally good for clear English-language domains and may be lower for short, ambiguous, non-English, or highly niche domains
language_detected returns Generic for any SLD shorter than 7 alphabetic characters, as short strings do not provide sufficient signal for reliable language detection
A small number of daily reports may be absent if they were not published or were inaccessible at collection time

License

You may use this dataset for any purpose — personal, research, or commercial — including using it to train models, build appraisal tools, power search products, or analyze domain market trends.

You may not:

Resell or redistribute the original files from this archive (the .jsonl, raw JSON files, or the vector database as-is)
Publish this dataset publicly for free download

You may:

Build and sell products or services based on this data
Use the vector database in a commercial production application
Create and sell derivative datasets (e.g. re-enriched with your own models)

Updates

This is v1.0 covering 2023-01-01 through 2026-04-02.

Future updates may include: extended date coverage, re-enrichment with higher-accuracy models, and additional computed fields. Updates are not guaranteed and depend on demand. If an update is released, buyers will receive it free of charge upon request. Notification will be sent where the selling platform allows it.

If you have specific field requests or find systematic classification errors, feel free to reach out.

Dataset compiled and enriched independently. Not affiliated with NameBio or NamePros.

ai

co

French

data

Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector Search (2023–2026)

Unlock Content

Analyze a ZIP archive of CIF files

Domain Sales Dataset v1.0

Overview

Archive Contents

Raw JSON files

Primary Dataset: `domains_enriched.jsonl`

Example record

Field Reference

Core fields (from raw data)

AI-generated fields

Code-computed fields

TLD Tiers

Word Structure Values

Language Detection

Vector Database

Important: how to search effectively

Setup

Basic usage

Example output (`search_example.py`)

Data Quality Notes

License

Updates

data

Domain Sales Dataset: 87K Real Transactions + AI Enrichment + Vector Search (2023–2026)

Unlock Content

Analyze a ZIP archive of CIF files

Domain Sales Dataset v1.0

Overview

Archive Contents

Raw JSON files

Primary Dataset: domains_enriched.jsonl

Example record

Field Reference

Core fields (from raw data)

AI-generated fields

Code-computed fields

TLD Tiers

Word Structure Values

Language Detection

Vector Database

Important: how to search effectively

Setup

Basic usage

Example output (search_example.py)

Data Quality Notes

License

Updates

Primary Dataset: `domains_enriched.jsonl`

Example output (`search_example.py`)