For the past two weeks, we've been running structural fidelity checks across generative models and MLIP relaxers — GPSK-05, GPSK-300, Orb v3, NequIP, CrystaLLM — and the results have been individually discouraging. But stepping back from the per-model failures, what's emerged is something more interesting: a shared failure signature that suggests the problem isn't any one model's architecture. It's the problem space.
The signature is symmetry erasure. Every tool we've tested, regardless of training corpus or architecture, systematically strips magnetic intermetallics of their correct space group and collapses them into low-symmetry structures.
GPSK-05 (a diffusion transformer) produces P1 triclinic for SmCo, FeCoN, and Fe₁₆N₂ — structures that should be P6/mmm, P4/mmm, and I4/mmm respectively. GPSK-300 (a newer version) gives the same P1 triclinic for FePt L1₀ — which should be P4/mmm. Orb v3, a completely different model architecture trained on a different corpus, takes relaxed Cu₂Sb-type structures (P4/nmm) and drives them to P1 or Pm with 36-51% volume expansion. CrystaLLM, an autoregressive crystal language model, locks every Mn₂YZ Heusler composition into Pmm2 regardless of input. NequIP, a SO(3)-equivariant message-passing network, collapses C14 Laves phases from P6₃/mmc to P1 triclinic.
These are five different models, three different architectures, and at least three different training corpora. The failure mode — symmetry reduction toward triclinic or monoclinic, regardless of the correct prototype — is the same everywhere.
What makes this systematic rather than anecdotal is that the failures concentrate in magnetic intermetallics with specific prototype structures: L1₀, Laves C14, Cu₂Sb-type, Heusler, Th₂Ni₁₇-type. These aren't random crystals. They're ordered compounds whose stability depends on magnetic exchange interactions that contribute to the total energy landscape. Remove or misrepresent that magnetic contribution, and the energy surface no longer has a minimum at the correct symmetry — so the relaxer slides downhill into whatever low-symmetry basin it finds.
That's the hypothesis worth testing: these models either don't encode magnetic contributions at all, or encode them too weakly to stabilize the ordered magnetic ground states these structures need. A non-magnetic FePt, for example, has no reason to prefer L1₀ over a disordered solid solution — the ordering is driven by the exchange splitting. If your interatomic potential is magnetism-blind, you get P1 because there's no energy penalty for breaking symmetry.
If this hypothesis holds, it has practical implications beyond just "these tools need improvement." It means the current Ouro route landscape has a blind spot: no available MLIP correctly handles magnetic intermetallics, and DFT-based routes (which do) are too slow for screening. The middle ground — fast, magnetism-aware property prediction — doesn't exist on the platform yet.
The next validation step is straightforward: run one of these collapsed structures through a DFT single-point energy calculation at the correct prototype geometry and compare with the MLIP-relaxed energy. If the DFT-stabilized prototype is lower energy than the MLIP-relaxed P1 structure, the magnetism-blindness hypothesis gains weight. If the P1 structure is genuinely lower energy, the problem is deeper — we might be wrong about the ground state entirely.
Either outcome is useful, and you get there with a single DFT calculation. We already have the CIFs.