Deliverable 2 of 3 — Part of the BEE-NET independent verification validation bundle.
The 5K superconductivity classification threshold is the single most consequential design choice in BEE-NET's evaluation. It determines what counts as a "positive," directly controls class balance, and inflates specificity-based metrics. This analysis quantifies how the dataset's class structure and estimated model performance change at alternative thresholds.
Data source: Superconducting chemical families — 1,887 composition families from the 3DSC database with per-family Tc statistics (mean, min, max, std, count).
Baseline confusion matrix (5K): Verified from Nascimento et al. (2026): TP=12,405; FN=11,871; FP=2,595; TN=1,268,070; total=1,294,941; prevalence=1.875%; imbalance=52.3:1.
Approach: Since BEE-NET probability scores are not publicly available, we cannot re-evaluate the model at each threshold. Instead, we:
Estimate how many 3DSC compounds qualify as positive at each threshold, using the known Tc distribution shape from SuperCon database literature.
Hold the model's predicted positive count fixed at 14,890 (TP+FP at 5K), reflecting a fixed decision boundary.
Estimate TP/FP/FN/TN at each threshold and compute derived metrics.
Threshold (K) | True Positives | Prevalence | Class Imbalance |
|---|---|---|---|
1 | ~27,917 | 2.16% | 45:1 |
3 | ~25,490 | 1.97% | 50:1 |
5 (baseline) | 24,276 | 1.88% | 52:1 |
10 | ~19,906 | 1.54% | 64:1 |
20 | ~15,779 | 1.22% | 81:1 |
30 | ~12,624 | 0.98% | 102:1 |
40 | ~9,710 | 0.75% | 132:1 |
77 | ~2,913 | 0.23% | 444:1 |
The 5K threshold sits near a natural inflection in the Tc distribution. Below 5K, the positive class grows slowly (most low-Tc compounds are clustered near 1–3K). Above 5K, each increment removes a substantial fraction of positives — the long tail of BCS superconductors between 5–20K.
Threshold | Recall | Precision | F1 | MCC | Bal. Accuracy |
|---|---|---|---|---|---|
1K | 53.7% | ~100% | 0.699 | 0.729 | 76.9% |
3K | 53.7% | 91.2% | 0.676 | 0.695 | 76.8% |
5K | 51.1% | 82.7% | 0.632 | 0.645 | 75.4% |
10K | 51.1% | 67.8% | 0.583 | 0.583 | 75.4% |
20K | 51.1% | 53.8% | 0.524 | 0.518 | 75.3% |
30K | 51.1% | 43.0% | 0.467 | 0.463 | 75.2% |
40K | 51.1% | 33.1% | 0.402 | 0.406 | 75.2% |
77K | 51.1% | 9.9% | 0.166 | 0.222 | 75.0% |
Notes on interpretation:
Recall is held constant at 51.1% for thresholds ≥5K because the model's decision boundary is trained at 5K and the predicted positive count is fixed. At 1K and 3K, recall increases slightly because additional true positives exist in the 1–5K range that the model also flags.
Precision degrades sharply above 10K because the predicted positive count stays fixed while the true positive count shrinks.
Balanced accuracy is nearly invariant (~75%) because specificity dominates at all thresholds (class imbalance is severe everywhere).
1. The 5K threshold is near-optimal for F1 among practical choices. At 1K, F1 is marginally higher (0.699 vs 0.632) because precision approaches 100%, but the 1K threshold includes many questionable low-Tc "superconductors" and the marginal gain is small.
2. Precision collapses above 10K. At 20K, precision drops to 54% — barely better than random among predicted positives. At 77K (the practical high-Tc discovery target), precision falls to 9.9%. Any BEE-NET screening for high-Tc candidates would flag ~14,890 compounds but only ~1,475 would actually have .
3. The 77K target creates a 444:1 imbalance. This is 8.5× worse than the 5K baseline. At this imbalance, even a perfect classifier on the positive class would have ROC-AUC dominated by the trivially high specificity. PR-AUC is the only meaningful metric for this regime.
4. Recall at 51.1% is the binding constraint at all thresholds. The model misses nearly half of superconductors regardless of threshold. This is a fundamental performance limitation, not a threshold artifact.
5. The inflection point is between 5–10K. Below 5K, precision is high and class imbalance is manageable. Above 10K, precision degrades rapidly and the task becomes progressively harder. The 5K threshold represents a reasonable engineering tradeoff.
If BEE-NET (or any classifier) is to be used for screening candidates with , the current architecture faces a fundamental challenge: the positive class is too rare (0.23% prevalence) and the model's 49% miss rate means it would overlook more than half of viable candidates.
Recommended adaptations:
Tiered screening: Use BEE-NET at 5K for broad filtering, then apply a physics-informed second stage (e.g., cuprate structural motifs, doping levels) for high-Tc triage.
Recall-priority retraining: If the goal is high-Tc discovery, retrain with a lower classification threshold and accept higher false positive rates — FPs are cheap in computational screening.
PR-AUC as primary metric: For any threshold ≥10K, report AUC-PR instead of ROC-AUC. The 53:1→444:1 imbalance progression makes ROC increasingly misleading.
Tc distribution fractions estimated from SuperCon database composition counts and the 3DSC family statistics. The key assumption is that BEE-NET's training set Tc distribution is proportional to the 3DSC distribution.
Model predicted positive count held fixed at 14,890 (the verified 5K sum TP+FP). In practice, retraining at a different threshold would shift this count, but the direction of the effect is captured here.
All estimates are approximate. The true threshold sensitivity requires BEE-NET probability score outputs, which are not yet available.
This is deliverable 2 of 3 in the BEE-NET validation bundle. Remaining: (1) Direct experimental Tc benchmark, (3) PR curve reconstruction. Data source: 3DSC chemical families dataset. Verification baseline: BEE-NET independent verification.
On this page
Quantifying how BEE-NET classification metrics shift across Tc thresholds (1K–77K) using the 3DSC dataset. Deliverable 2 of 3.