Open research towards the discovery of room-temperature superconductors.
Discover other posts like this one
MatterGen is a diffusion model built for materials discovery published by Microsoft, trained on materials datasets Alexandria, ICSD (licensed data so it isn't publicly released), and Materials Project. It builds on previous generative materials modeling work such as DiffCSP and CDVAE, and focuses on producing materials that are:
Stable ("We consider a structure to be stable if its energy per atom after relaxation via DFT is within 0.1 eV/atom above the convex hull defined by our Alex-MP-ICSD refer- ence dataset comprising 850,384 unique, ordered structures recomputed from the MP, Alexandria, and ICSD datasets.").
Unique ("We consider a structure to be unique if it does not match any other structure in a batch of samples generated by a given method, where uniqueness is computed among all samples in a batch with the same reduced chemical formula via our ordered-disordered structure matcher.").
Novel ("We consider a structure to be novel if it does not match any structure in an extended version of our Alex-MP-ICSD reference dataset, containing 117,652 disordered ICSD structures in addition to the 850,384 ordered structures used to compute the reference convex hull")
Unfortunately (but unsurprisingly) the Microsoft fellas clearly didn't build this on Mac machines, so getting this to run locally on a Mac is pretty much impossible without a decent overhaul of their implementation of pyg-lib, and a flexible extension to cpu or Apple Metal as everything is CUDA enabled.
That being said, the LightingAI environments came in clutch and it was very easy to spin it up there.
MatterGen produces .extxyz and .cif files that contain both the produced crystals as well as denoising (diffusion inference step) trajectories that can make for some cool visuals:
Using the chemical system fine-tune (inference input is a list of elements, Lithium, Cobalt, and Oxygen for the figure above), the produced structures often resolved to unit cell configurations that don't adhere to the common list of crystalline configurations: FCC, BCC, etc.
Here is the final crystal structure from the above denoising animation:
While they do offer a symmetry fine-tune, there's plenty of room for improvement here in terms of evaluating the feasibility of produced structures when it comes to manufacturing at scale.
On the evaluation front, MatterGen leans on an earlier Microsoft work, MatterSim, for high level validation of structure outputs:
{"avg_energy_above_hull_per_atom": {
"value": 0.10954873834960921,
"description": "Average energy above hull per atom (eV/atom) of structures in sampled data."
},
"avg_rmsd_from_relaxation": {
"value": 0.04274301398109797,
"description": "root mean square displacements of atoms (Angstrom) from initial to final DFT relaxation steps in sampled data."
},
"frac_novel_unique_stable_structures": {
"value": 0.0,
"description": "Fraction of novel unique stable structures in sampled data within 0.1 (eV/atom) above convex hull of MP2020correction."
},
"frac_stable_structures": {
"value": 0.0,
"description": "Fraction of stable structures in sampled data within 0.1 (eV/atom) above convex hull of MP2020correction."
},
"frac_successful_jobs": {
"value": 1.0,
"description": "Fraction of structures whose jobs ran successfully."
},
"avg_comp_validity": {
"value": 1.0,
"description": "Average composition validity (according to smact) of structures in sampled data."
},
"avg_structure_comp_validity": {
"value": 1.0,
"description": "Average number of structures in sampled data that are both valid structures and have a valid smact compositions."
},
"avg_structure_validity": {
"value": 1.0,
"description": "Average structural validity of structures in sampled data. Any atom-atom distances less than 0.5 Angstroms or a volume less than 0.1 Angstrom**3 are considered invalid ."
},
"frac_novel_structures": {
"value": 1.0,
"description": "Fraction of novel structures in sampled data."
},
"frac_novel_systems": {
"value": 0.0,
"description": "Fraction of distinct chemical systems in sampled data and not in MP2020correction."
},
"frac_novel_unique_structures": {
"value": 1.0,
"description": "Fraction of novel unique structures in sampled data."
},
"frac_unique_structures": {
"value": 1.0,
"description": "Fraction of unique structures in sampled data."
},
"frac_unique_systems": {
"value": 1.0,
"description": "Fraction of structures in sampled data that have a unique chemical system within this set."
},
"precision": {
"value": 0.0,
"description": "Precision of structures in sampled data compared with MP2020correction. This is the fraction of structures in sampled data that have a matching structure in MP2020correction."
},
"recall": {
"value": 0.0,
"description": "Recall of structures in sampled data compared with structures in MP2020correction. This is the fraction of structures in sampled data that have a matching structure in MP2020correction."
}
}
But clearly this evaluation harness falls short for our evaluation aspirations. I think this could be a good place for some development as there are overlaps with this potential evaluation pipeline and the commercialized version.
For broader improvements, the authors themselves discuss limitations that they believe can be overcome by expanding the training dataset and even including non-scalar values such as band structure or XRD spectra. A full scale retraining is also not out of the realm of possibilities as "only" 8 A100 GPUs were used to train MatterGen over 1.74 million steps.
Let's see if scaling can hold true here too.
Oh and here are some completely new, never before seen Li-Co-O systems: