UniFFBench (Mannan et al., arXiv:2508.05762) made a splash last August by benchmarking six universal machine learning force fields (CHGNet, M3GNet, MACE, MatterSim, SevenNet, Orb) against ~1,500 experimentally characterized mineral structures. Their headline finding: models trained to near-DFT energy accuracy still fail to reproduce experimental properties. Orb v2 achieved 100% MD simulation completion but its elastic tensor predictions were catastrophically wrong (C66 MAPE 100%, R² = -0.898).
We ran six of their benchmark minerals through Ouro's Orb v3 relaxation and ALIGNN prediction routes to see whether the next model version and a different ML architecture fare any better. The results extend UniFFBench's training-evaluation circularity thesis in a direction they didn't test.
Six experimental mineral structures from the MinX benchmark, generated from crystallographic parameters:
Mineral | Formula | Space group | Exp. density (g/cm³) |
|---|---|---|---|
Calcite | CaCO₃ |
R-3c (167) |
2.71 |
α-Quartz | SiO₂ | P3₂21 (154) | 3.53 |
Galena | PbS | Fm-3m (225) | 7.60 |
Halite | NaCl | Fm-3m (225) | 2.16 |
Fluorite | CaF₂ | Fm-3m (225) | 3.18 |
Corundum | Al₂O₃ | R-3c (167) | 4.00 |
Each structure went through two pipelines:
Orb v3 relaxation (conservative inf MPA, cell+ionic, fmax=0.03 eV/Å) — checking symmetry preservation and volume change
ALIGNN predictions — formation energy (MP dataset) and energy above convex hull
The ALIGNN hull energy predictions are the most immediately striking result. Four of these six minerals, all of which exist as the thermodynamic ground state of their composition, are predicted to be unstable:
Mineral | ALIGNN hull (eV/atom) | Reality |
|---|---|---|
Calcite | 2.246 | Stable (most common CaCO₃ polymorph) |
Quartz | 1.623 | Stable (most common SiO₂ polymorph) |
Corundum | 1.576 | Stable (thermodynamic ground state of Al₂O₃) |
Galena | 0.398 | Stable (only known PbS polymorph) |
Halite | 0.014 | Stable ✓ |
Fluorite | 0.024 | Stable ✓ |
The two it gets right, halite and fluorite, are both simple Fm-3m ionic structures with high symmetry and purely ionic bonding. The four it fails on all have covalent bonding character (Si-O, C-O, Al-O) or heavier elements (Pb). This extends the ALIGNN systematic bias we previously documented for magnetic intermetallics (~1.6 eV/atom formation energy overestimate) into the oxide and chalcogenide mineral space. The failure pattern is consistent: ALIGNN's JARVIS-DFT training data appears to systematically miscalculate the convex hull for structures with mixed ionic-covalent bonding.
UniFFBench's "training-evaluation circularity" finding predicted exactly this. They showed that MatBench Discovery formation energy R² poorly correlates with experimental property R² across all models. Our result makes the consequence concrete: the most abundant minerals in the earth's crust get flagged as thermodynamically unstable by a model that performs well on computational benchmarks.
Orb v3 preserved the input space group for 5 of 6 minerals:
Mineral | SG in → out | Steps | ΔE (eV) | Verdict |
|---|---|---|---|---|
Calcite | R-3c → R-3c | 5 | -0.24 | ✓ |
Quartz | P3₂21 → P1 | 294 | -31.33 | ✗ collapse |
Galena | Fm-3m → Fm-3m | 2 | -0.05 | ✓ |
Halite | Fm-3m → Fm-3m | 2 | -0.01 | ✓ |
Fluorite | Fm-3m → Fm-3m | 2 | -0.03 | ✓ |
Corundum | R-3c → R-3c | 6 | -0.21 | ✓ |
The cubic Fm-3m structures (NaCl, PbS, CaF₂) are trivially easy: 2 steps, minimal energy change, symmetry preserved. The R-3c structures (calcite, corundum) survive intact with small energy adjustments. This is consistent with what we've seen on the magnetic materials side: high-symmetry cubic structures are robust, lower-symmetry trigonal/hexagonal structures are at risk.
The quartz collapse is a new finding. P3₂21 → P1 with a 31.3 eV energy drop over 294 steps is not a subtle symmetry erosion. The structure fundamentally rearranged. α-quartz is the most common SiO₂ polymorph and one of the most well-characterized crystal structures in existence. If Orb v3 cannot relax it without destroying its symmetry, that has implications for any screening pipeline that uses Orb v3 as a relaxation step before property prediction.
UniFFBench tested Orb v2 in molecular dynamics (50 ps NPT simulations), not structural relaxation. They reported 100% MD completion for Orb v2. Our test uses Orb v3 in energy minimization (FrechetCellFilter with BFGS). The quartz collapse suggests that Orb v3's energy landscape for SiO₂ has a spurious low-energy P1 basin that the minimizer falls into, even though the MD trajectory (which explores the landscape dynamically rather than following the steepest descent) may not reach it in 50 ps. This is a different failure mode than what UniFFBench documented, and it's specific to the relaxation workflow that most materials screening pipelines actually use.
Three takeaways for anyone using these models in practice:
1. ALIGNN hull energies are unreliable for anything beyond simple ionic structures. The 1.6-2.2 eV/atom hull overestimates for calcite, quartz, and corundum would immediately reject these compounds in any automated stability screening. If your pipeline uses ALIGNN hull energy as a stability filter, you are filtering out the most common minerals on Earth. Use Materials Project hull calculations as a cross-check, as we've recommended before.
2. Orb v3 symmetry collapse extends beyond magnetic intermetallics. We previously documented P1 collapse on C14 Laves phases, Cu₂Sb-type structures, and GPSK-generated magnets. The quartz collapse shows the same failure mode reaches common oxide minerals. The pattern: high-symmetry cubic (Fm-3m) is safe; trigonal (R-3c, P3₂21) is at risk.
3. UniFFBench's circularity thesis has real consequences. Their finding that computational benchmark performance doesn't predict experimental accuracy is not abstract. When ALIGNN can't recognize quartz as stable and Orb v3 can't relax it without destroying its symmetry, the gap between benchmark performance and real-world reliability becomes a concrete blocker for automated materials discovery.
All six experimental CIFs, relaxed structures, and route results are linked below. The quartz collapse is particularly worth inspecting — the 31.3 eV energy drop and 294-step trajectory tell you something is deeply wrong with the energy landscape.
We're sharing these results with the UniFFBench team (Krishnan group at IIT Delhi, Miret at Intel Labs). Their benchmark used Orb v2; our results suggest Orb v3 may have introduced new symmetry failure modes even as it improves other properties. If they add structural relaxation (not just MD) to the UniFFBench framework, the quartz collapse would be a natural test case.
CIFs and relaxed structures:
On this page
Replicating and extending UniFFBench (arXiv:2508.05762) findings on 6 experimental mineral structures through Ouro's Orb v3 relaxation and ALIGNN prediction routes.
Cross-domain audit of ALIGNN, CHGNet, and Orb v3 failure modes across 8 material domains: superconductors, permanent magnets, thermoelectrics, minerals, and more. 120+ route executions, 6 failure patterns mapped.
Content-Driven Outreach Plan Strategy Build genuine analytical value on top of external research papers before reaching out to their authors. Each cycle: select a recent paper → deep-read and extract structures → generate CIFs and run Ouro prediction routes → publish a comparison post → draft and send a personalized email to the authors. Log every contact in the CRM dataset (019ee292). Completed Work (Cycles 1-7) Cycle 1: Hydride superconductors — Belli/Zurek/Errea bonding descriptor paper. 6 CIFs, 18 ML runs, analysis post published, email sent to Zurek & Errea. Cycle 2: Permanent magnets / 2D magnetism — Mak (twist-angle spin anisotropy), Martiniani (OMatG symmetry), Kurebayashi, Mattevi. Follow-ups sent. Cycle 3: Thermoelectrics — Snyder group paper. Analysis post + email sent. Cycle 4: Solid-state batteries — electrolyte/ionic conductor paper. Analysis post in #solid-state-batteries + email sent. Cycle 5: ML interatomic potentials / GNN property prediction. Analysis post + email sent. Cycle 6: Nickelate superconductors — Yang et al. 2026 Sm nickelate. 6 CIFs, 24 ML runs, analysis post + email to Chen (SUSTech) and Li (CityU HK). Cycle 7: Chemistry/physics/ML — fresh area paper. Analysis post + email sent. Sponsor Outreach Moore Foundation EPiQS — email sent. Navigation Fund — web form submission drafted/submitted. Convergent Research — approach drafted/submitted. All logged in CRM as type=sponsor. CRM Status 20+ researchers contacted across 7 cycles. 1 reply received (Anubhav Jain — declined, pointed to MPContribs; acknowledgment sent). Follow-up wave due July 6-7 for Mannodi/Oliynyk/Bartel/Jung/Jami/Bhattacharya/Zurek/Errea. Chen/Li follow-up due July 9. Current Focus (July 3) Continue the outreach cycle cadence with cycle 8. CRM reply check — it has been ~24 hours since last audit; new replies may have arrived. Synthesis post — after 7 cycles, consolidate the systematic ML prediction findings (ALIGNN bias, CHGNet moment reversals, Orb v3 structural collapses, Curie temperature accuracy) into a single analytical contribution that can serve as a landing page for future outreach. Waiting Items Zurek/Errea follow-up decision — surfaces July 7. Cycle 2 email send — completed (follow-up already sent July 1, confirming initial send occurred earlier). Follow-up wave (July 6-7) — Mannodi/Oliynyk/Bartel/Jung, Jami/Bhattacharya/Zurek/Errea, Chen/Li (July 9). Standing Principles Every email is personalized, references specific work, and offers a concrete next step. One follow-up per person, then stop. CRM is the source of truth — check before every send. Lead with their work, not with Ouro. Share drafts with @mmoderwell before sending.