Open research towards the discovery of room-temperature superconductors.
Discover ways to transform this asset
POST /speech/from-post
How this post is connected to other assets
After reading the MatterSim paper, the authors proposed the idea of using the MLFF's latent space as a direct property prediction feature set. Earlier, and I had been thinking about using a VAE (or some graph variant) to embed a material into a continuous high-dim space which we could use as a feature vector to predict properties directly.
While on the surface they sound like a very similar things, the VAE's latent space has the property of being able to reconstruct the original material. This maybe useful for comparing material to material, but when it comes to prediction tasks this approach is likely to fall short.
Using a MLFF's latent space is different because the latent space has been produced to predict downstream features already! Energies, forces and stresses (the usual outputs of DFT).
There are infinite possible latent representations a material could have; and some of these latent representations are going to be better than others.
So let's get into it.
The approach to this experiment is simple:
Use the latent features generated by the Orb model as a feature vector to train another model on a Tc prediction task (regression).
Using crystal structure and Tc from the 3DSC dataset, we have all the input and target data we need. There are caveats to the dataset as explained in the 3DSC paper, but it's the best open dataset I've come across so far:
Data-driven methods, in particular machine learning, can help to speed up the discovery of new materials by finding hidden patterns in existing data and using them to identify promising candidate materials. In the case of superconductors, the use of data science tools is to date slowed down by a lack of accessible data. In this work, we present a new and publicly available superconductivity dataset (‘3DSC’), featuring the critical temperature Tc of superconducting materials additionally to tested non-superconductors.
The Orb model takes in a crystal structure and outputs energies, forces, and stresses at the atom and cell level as expected by comparable DFT outputs. They way the authors designed the model, each of these prediction heads is independent from the base model so we have a representation in the base model that can be used for other tasks. See the paper for more details on model architecture:
Authors introduce Orb, a family of universal interatomic potentials for atomistic modeling of materials. Orb models are 3-6 times faster than existing universal potentials, stable under simulation for a range of out of distribution materials and, upon release, represented a 31% reduction in error over other methods on the Matbench Discovery benchmark. https://arxiv.org/abs/2410.22570
Before we get into the modeling, I want to explore the latent space we're working with.
Plot produced by taking the features generated by Orb (256 dim output) and visualizing different dimensionality reduction methods on them and coloring the point by Tc from the 3DSC database.
In the UMAP model, I've been testing out its supervised features where you also pass in the target variable and it will attempt to create closeness in projected dimension to closeness in the target variable. As you can see in the plot on the left, this was fairly successful but it also could be some form of overfitting.
The next step to explore would be inverse transformations. We know the high Tc superconducting region in the 2D plot, so we can sample points there and inverse transform them back into the original 256 dimension vector, which we can then use to predict Tc in our downstream model, or somehow reverse through the Orb model to get the crystal structure (is this possible? or do we need to train a decoder off of Orb to bring it back to crystal structure space).
I started this experiment attempting to work with the MLP setup built into the Orb model. Essentially the same approach as the existing prediction heads, I found that these did not train well for this task. With such a simple input feature set (256 continuous variables), we can really use any model. So why not throw XGBoost at it!
XGBoost was actually giving me some issues (I suspect a Mac silicon issue) so I switched to CatBoost instead.
It trains in seconds now instead of hours. And the performance is awesome. I'll have a more robust evaluation coming soon.
The basic CatBoost config is as follows:
{
"iterations": 1000,
"learning_rate": 0.1,
"eval_metric": "RMSE",
"early_stopping_rounds": 50,
}
Train and validation sizes are as follows:
Train size: 4618 samples
Val size: 1155 samples
It's a pretty small dataset. SuperCon is already pretty small, and the 3DSC methodology cuts this down even further in an attempt to match chemical formula to crystal structure.
Training is straight forward, and so is prediction. You pass the crystal you want to predict Tc for through the first parts of Orb (to get the latent vector) then pass that vector into the CatBoost model to get your final prediction.
There's some predictive power, no doubt about it! With a on the eval set, the model is definitely picking up signal in the feature set leading to Tc prediction.
Evaluating the validation set (1155 samples) on the trained CatBoost model to predict Tc.
A few things to note:
There are a number of 0 K true Tc materials that have higher predicted Tc. This could be bad modeling, but it could also be bad dataset. The authors of 3DSC set "non-superconducting" materials to 0 K arbitrarily, so there could be errors in lab or reported metrics (missing data) that cause that and as such these materials actually do have non-zero Tc as predicted by our model.
Obviously this is a very imbalanced and non-normal target distribution. There are only a few materials with Tc over 80 K, but it's good to see the model still does well on these samples. Remember that while this validation set was used for early stopping, the model did not explicitly train on any of these materials.
One of the most important features of a successful Tc prediction model for our use case is going to be out-of-distribution target prediction. There are no room-temp, ambient pressure superconductors that we know of, so there are no samples we can use to train our model on in that area of the target distribution.
In this experiment, I artificially held out the highest Tc samples in the dataset, trained the model, then evaluated those samples to see if we could could get a prediction higher than what was seeing in the training data.
Spit dataset at 80 K (the max temperature seen by the model is 80 K)
Train size: 5619 samples
Val size: 154 samples
Cutting the dataset to samples with Tc below 80 K, we find that the model is unable to make any predictions greater than 80 K. Not very surprising, but a critical failure for what we need this model to do.
Split dataset at 100 K (the max temperature seen by the model is 100 K)
Train size: 5750 samples
Val size: 23 samples
Cutting the dataset to train a model only on samples below 100 K, we test the model's ability to predict on materials with true Tc greater than 100 and the results are not good.
As we've seen with these two examples, the model struggles to predict a Tc greater than what it's seen in the training data. This is problematic! But potentially it's still something we can deal with in the way we train the model, the parameters set, and the choice of model.
Getting tree-based models to extrapolate beyond their training range is notoriously difficult since they essentially work by partitioning the input space and making local predictions.
After more experimentation, the best I've been able to get so far is 5-10 K outside of training distribution with an MLP. Even then, it was mostly an anomaly and the rest of the tested OOD materials did not predict well.
Discover other posts like this one