Open research towards the discovery of room-temperature superconductors.
Discover ways to transform this asset
POST /speech/from-post
How this post is connected to other assets
Discover other posts like this one
So far a really interesting paper. Published in 2018. Adding some informal notes and interesting findings here. Finding out how much literature is based on this study.
This post will focus on the methods available to predict/derive of a material. We want to be able to build a pipeline where we can go beyond the available (and experimental) Tc data and train a model
Some notes as I read:
After reading the MatterSim paper, the authors proposed the idea of using the MLFF's latent space as a direct property prediction feature set. Earlier, and I had been thinking about using a VAE (or s
Careful evaluation of the classifier model is important so that we can truly understand the capabilities and performance of a Tc predicting model.
Particularly important to us is the ability for the model to actually grok the underlying causes of superconductivity. Not an easy task. Because our best science doesn't have a fully formed formula for how superconductivity arises, let alone a good Tc estimation method, we are using machine learning to attempt to approximate whatever that function is.
In earlier attempts, it became clear that the model was doing some kind of material class recognition. While it's a useful shortcut, it is completely naive to why a material may or may not be superconducting.
Unfortunately, this is going to be a common theme, but this time our model has a better chance of recognizing the causes of superconductivity. Learn more about the model here:
Using what we learned when trying to use the MLFF's latent space for Tc prediction, there's a way we can simplify things for the prediction model and give it a better change of picking up on the signa
It should be noted that for all of these evaluations, many of them are somewhat partial completions (preferring to test on higher-Tc materials). Predictions take around a minute on current hardware (NVIDIA T4), which adds up when you have thousands of samples to evaluate. For all of these evaluations, we are looking at predicted Tc, not just instantaneous temperature classification of Tc or not. This requires MD temperature ramping simulations which is the costly part.
Starting off with a basic evaluation, we look at a holdout set. Prior to train-test split, the materials were shuffled, ideally leading to an even split of material families in the training and test sets. We'll look at that more closely later.
As expected, the model has not trained on any of the materials in the test set.
Interactive plot of predicted vs. true Tc on the evaluation set.
At a high level, there is some really exciting results, and some troublesome results. On some, we hit big and predict close on a material with Tc > 100 K. Other times, we predict non-superconducting on a material that is actually a high-Tc material and get large error. We'll explore potential causes for these big misses, but it is good to see that even high-Tc we can predict pretty well. The reason this is significant, beyond the usefulness for our use case, is that the model is actually predicting at every temperature step (from O K to 100+) and is correctly predicting superconductivity for all those points up until the temperature, till which it then switches to no-SC.
The point is that the feature vectors are very similar from temperature to temperature. But somewhere in there we hope there is a "phase change" in the vector that signals that transition point, be it in the atomic structure, electronic potential, or elsewhere. It's potentially evidence that we're learning the drivers of superconductivity from the feature vector encoding. Also not that the simulation temperature is not explicitly included as a feature.
From the parity plot, we see the range from that really suffers. Consistent under-prediction of Tc in that range by about 20 K. Conversely, in the range we're consistently over-predicting, though not as poorly.
Interactive plot below. Enter fullscreen for a better experience. The legend has the evaluation set you can activate to compare train and eval together.
Visualizing the counts of materials in the training and evaluation dataset by their Tc. First bin is non-superconductors, the rest are ranges of 20 K increments.
Unsurprisingly, we find a pretty imbalanced dataset. Most superconductors found have been a very low and close to 0 K temperatures. The 0.01 to 20 K range dominating significantly, it's possible that the type of superconductivity we're learning in this range is different than that at other temperatures. This seems to be how it's understood in science, with BCS describing lower temperature and some unknown theories describing higher temperatures. Ideally, our model will find a unifying approximate function.
Zooming in on the under-predictions of this temperature region, we notice something about the nature of the materials here. Of the 20 samples, all but one of them is a synth_doped material (20:1). Compared to the full sample's ratio of 3:1 (synth_doped:normal), this may be showing us a flaw in the way these materials were created by the authors of the 3DSC dataset.
Paper here on the methodology for how that was done:
Data-driven methods, in particular machine learning, can help to speed up the discovery of new materials by finding hidden patterns in existing data and using them to identify promising candidate materials. In the case of superconductors, the use of data science tools is to date slowed down by a lack of accessible data. In this work, we present a new and publicly available superconductivity dataset (‘3DSC’), featuring the critical temperature Tc of superconducting materials additionally to tested non-superconductors.
The idea is that these materials are estimations from a chemical formula, and as such may not be correct.
What if we did some kind of distillation of materials, where we train a few models and each time remove the worst performers of the prior model so that we get to a model that near perfectly predicts this subset of materials. Obviously need to watch out for overfitting and the size of our dataset is an issue, but potentially this could lead us to a model more true to nature which will ultimately aid us in our use case. Anyways, it's pretty problematic when you can't even trust your dataset as the source of truth making this project all the more challenging.
Already, the training data for superconductors with Tc greater than 80 K is extremely limited. Out of ~5700 materials in our dataset, just 152 materials exist in our dataset. We'll cut our training data at 90 K so we can at least get a few of these high-Tc materials to train on, then evaluate the model on how well it can predict the materials we know of with Tc greater than 90.
In short, yes we can!
Seeing how the Orb latent space classifier model predicts materials with Tc greater than 90 K, when the model was only trained on materials with Tc less than 90 K.
The model has no problem with it. Although the predictions need some tuning, temperature upper-bounding does not seem to be an issue.
Because there is nothing directly encoding temperature in the feature set, our model is less likely to try to predict temperature correlated effects. If this were so, we'd be (roughly) bounded by the max Tc seen in the dataset as it would strongly use the temperature feature to make predictions. I actually did see this behavior when I had tested the model with the added temperature feature, and removing did exactly what we hope and removed the temperature bounding.
These are the predictions made by a model that was trained on materials with max Tc of 90 K. Here we show how well it could predict a material (Ba2Ca1Cu2Hg1O6.24-MP-mp-6879-synth_doped) with a Tc greater than this range.
We find that again in general we slightly overestimate Tc, but this is far better than being unable to predict out of target distribution materials. Another way to interpret what the model is doing is that it is looking for the point at which superconductivity breaks. Ramping up from 0K, predict 1 until it breaks, then predict 0. While this is temperature caused, more fundamentally it's caused by changes in the lattice and other quantum properties.
Predictions are non-deterministic because of the MD simulations involved. You get different predictions each time you run! More testing needs to be done on the consistency of predictions.
Temperature instability is likely an issue for high-Tc materials. It seems that at low temps and/or smaller/simpler materials, the temp ramp is stable. At higher temps, I think there are greater fluctuations and it can cause trouble when trying to figure out at what temperature a material is no longer superconducting.
Still haven't come to a good determinant on when a material should be considered no longer superconducting. We have some cases of reentrant superconductivity (unconfirmed), and I've seen cases where in the range of MD temperatures we don't find a point at which the material stops superconductivity. Related to the point above and the sparsity of predictions and non-linear temperature continuity as we're heating.
We notice that near the true Tc, we are starting to predict non-superconductivity, but it's only two points below 50% and they aren't necessarily continuous because of how measured temperature is fluctuating.
I didn't do any relaxations of materials or supercells. Could be a problem.