Open research towards the discovery of room-temperature superconductors.
How this post is connected to other assets
Discover other posts like this one
After reading the MatterSim paper, the authors proposed the idea of using the MLFF's latent space as a direct property prediction feature set. Earlier, and I had been thinking about using a VAE (or some graph variant) to embed a material into a continuous high-dim space which we could use as a feature vector to predict properties directly.
While on the surface they sound like a very similar things, the VAE's latent space has the property of being able to reconstruct the original material. This maybe useful for comparing material to material, but when it comes to prediction tasks this approach is likely to fall short.
Using a MLFF's latent space is different because the latent space has been produced to predict downstream features already! Energies, forces and stresses (the usual outputs of DFT).
There are infinite possible latent representations a material could have; and some of these latent representations are going to be better than others.
So let's get into it.
The approach to this experiment is simple:
Use the latent features generated by the Orb model as a feature vector to train another model on a Tc prediction task (regression).
Using crystal structure and Tc from the 3DSC dataset, we have all the input and target data we need. There are caveats to the dataset as explained in the 3DSC paper, but it's the best open dataset I've come across so far:
The Orb model takes in a crystal structure and outputs energies, forces, and stresses at the atom and cell level as expected by comparable DFT outputs. They way the authors designed the model, each of these prediction heads is independent from the base model so we have a representation in the base model that can be used for other tasks. See the paper for more details on model architecture:
Before we get into the modeling, I want to explore the latent space we're working with.
In the UMAP model, I've been testing out its supervised features where you also pass in the target variable and it will attempt to create closeness in projected dimension to closeness in the target variable. As you can see in the plot on the left, this was fairly successful but it also could be some form of overfitting.
The next step to explore would be inverse transformations. We know the high Tc superconducting region in the 2D plot, so we can sample points there and inverse transform them back into the original 256 dimension vector, which we can then use to predict Tc in our downstream model, or somehow reverse through the Orb model to get the crystal structure (is this possible? or do we need to train a decoder off of Orb to bring it back to crystal structure space).
I started this experiment attempting to work with the MLP setup built into the Orb model. Essentially the same approach as the existing prediction heads, I found that these did not train well for this task. With such a simple input feature set (256 continuous variables), we can really use any model. So why not throw XGBoost at it!
XGBoost was actually giving me some issues (I suspect a Mac silicon issue) so I switched to CatBoost instead.
It trains in seconds now instead of hours. And the performance is awesome. I'll have a more robust evaluation coming soon.
The basic CatBoost config is as follows:
{
"iterations": 1000,
"learning_rate": 0.1,
"eval_metric": "RMSE",
"early_stopping_rounds": 50,
}
Train and validation sizes are as follows:
Train size: 4618 samples
Val size: 1155 samples
It's a pretty small dataset. SuperCon is already pretty small, and the 3DSC methodology cuts this down even further in an attempt to match chemical formula to crystal structure.
Training is straight forward, and so is prediction. You pass the crystal you want to predict Tc for through the first parts of Orb (to get the latent vector) then pass that vector into the CatBoost model to get your final prediction.
There's some predictive power, no doubt about it! With a on the eval set, the model is definitely picking up signal in the feature set leading to Tc prediction.
A few things to note:
There are a number of 0 K true Tc materials that have higher predicted Tc. This could be bad modeling, but it could also be bad dataset. The authors of 3DSC set "non-superconducting" materials to 0 K arbitrarily, so there could be errors in lab or reported metrics (missing data) that cause that and as such these materials actually do have non-zero Tc as predicted by our model.
Obviously this is a very imbalanced and non-normal target distribution. There are only a few materials with Tc over 80 K, but it's good to see the model still does well on these samples. Remember that while this validation set was used for early stopping, the model did not explicitly train on any of these materials.
One of the most important features of a successful Tc prediction model for our use case is going to be out-of-distribution target prediction. There are no room-temp, ambient pressure superconductors that we know of, so there are no samples we can use to train our model on in that area of the target distribution.
In this experiment, I artificially held out the highest Tc samples in the dataset, trained the model, then evaluated those samples to see if we could could get a prediction higher than what was seeing in the training data.
Spit dataset at 80 K (the max temperature seen by the model is 80 K)
Train size: 5619 samples
Val size: 154 samples
Split dataset at 100 K (the max temperature seen by the model is 100 K)
Train size: 5750 samples
Val size: 23 samples
As we've seen with these two examples, the model struggles to predict a Tc greater than what it's seen in the training data. This is problematic! But potentially it's still something we can deal with in the way we train the model, the parameters set, and the choice of model.
Getting tree-based models to extrapolate beyond their training range is notoriously difficult since they essentially work by partitioning the input space and making local predictions.
After more experimentation, the best I've been able to get so far is 5-10 K outside of training distribution with an MLP. Even then, it was mostly an anomaly and the rest of the tested OOD materials did not predict well.