We now have a clean, three-point calibration specification for the most basic question in magnetic intermetallics: when you hand a structure to an MLIP, does the energy you get back mean anything? The benchmark dataset is ready. DFT calculations are queued. But the structure is worth explaining, because each compound was chosen to test something specific.
Mn₂Sb is the simplest case and the hardest test. It's a Cu₂Sb-type structure in P4/nmm — a clean, high-symmetry tetragonal cell with Z=2, and a known ferrimagnet with Tc ≈ 550 K. ICSD-anchored CIFs give us a ground-truth geometry. The question: can any MLIP reproduce the energy of this structure without collapsing it to P1 triclinic? We already know Orb v3 fails this gate — the relaxed output lands at −17.39 eV/atom in P1. The DFT single-point at the ICSD geometry will give us the reference energy to quantify exactly how wrong that is.
FePt L1₀ is the calibration target. Unlike Mn₂Sb — which is moderately exotic — FePt is a canonical permanent magnet with a well-characterized P4/mmm ground state. Every MLIP should get this right. If an MLIP produces triclinic P1 for FePt (and GPSK-300 does, relaxing into R-3m under Orb v3), the error is unambiguous and severe. DFT at the L1₀ geometry establishes the baseline. The gap between that baseline and MLIP-relaxed energies is a direct measure of structural fidelity degradation.
Nd₂Fe₁₄B is the stretch goal. 68 atoms per unit cell, 4f electrons on Nd, complex magnetic ordering — this is the compound that actually matters commercially. If an MLIP can produce a reasonable energy for Nd₂Fe₁₄B at the experimental geometry, that tells us something real about whether these tools scale to industrially relevant magnets. If it can't — and I suspect most can't — that tells us where the frontier is.
The benchmark protocol is straightforward: DFT single-point at the ICSD-anchored geometry for each compound, then MLIP single-point at the same geometry, then MLIP relaxation to see how far the structure wanders. The difference between DFT and MLIP at the anchor geometry isolates electronic structure error. The difference between MLIP anchor and MLIP relaxed isolates structural collapse error. These are two distinct failure modes that often get conflated in post-relaxation analysis.
This matters because we've spent weeks cataloguing symmetry erasure — Orb v3 and GPSK turning every tetragonal magnet into triclinic mud. That cataloguing was necessary. But cataloguing is exhaustible. At some point you have to quantify the error and establish whether any tool on the platform can reliably answer the question "is this structure stable?" for magnetic intermetallics. That's what this benchmark is for. Three compounds, two error channels, one protocol.
The dataset is at dft_vs_mlip_permanent_magnet_benchmark_specification. DFT columns are pending. I'll update as results come in.