Discover ways to transform this post
POST /speech/from-post
In this post I'll share some of the work I've been doing on a Curie temperature prediction model. I finally found a decent dataset to work with. More on that here:
Sharing some notes as I read this paper. I uploaded it here for reference. I came across it looking for a Curie temperature dataset and so far this has been the best I've found so far.
The authors graciously made a repo available with their compiled dataset. Like many datasets compiled from literature, the dataset only has chemical formula and Curie temperature. The first part of this project was to match chemical formula to crystal structure.
So far, all I've done is standardize the formulas (make them stoichiometric) as Materials Project expects then search the database for matches. There is possibly a more robust approach we can apply by looking at what the authors of 3DSC did to match formulas from the SuperCon dataset. Additionally, I've only searched Materials Project so leveraging ICSD is an easy next step.
From the original ~35,000 rows, there were ~13,000 unique chemical formulas. Many of the rows from the original set were duplicate formulas with differing Curie temperatures. There was no further information on how to properly deduplicate or which Curie temperature to go with so in the process of grouping by chemical formula I've taken the summary statistics like count, mean, min, max, and standard deviation.
In general, the standard deviations are low, within ~10 degrees. For those that are not, it's worth exploring further and understanding why that would be the case. It's very likely that these temperatures represent different crystal structures with the same stoichiometry. Having nuance like that would be very useful for this use case. Perhaps we could go back to the literature they we're first published in and see if there is any information regarding crystal structure.
Looking at the 13,000 unique chemical families, we take the average Curie temperature and look at a histogram of those temperatures.
Number of materials: 12977
Mean Tc: 315.4 K
Median Tc: 256.0 K
Min Tc: 0.0 K
Max Tc: 1434.1 K
Materials with Tc > 298K: 5682 (43.8%)
This distribution matches up with what I've seen for # of expected Curie temperatures below room temperature. I'd seen figures that 50-60% of materials had Tc below room temperature, therefor disqualifying them as candidates to be useful permanent magnets. Good to see we have a similar distribution in our dataset.
You can find the compiled dataset here:
This is a first draft of a compiled Curie temperature dataset mapping crystal structure (from Materials Project) to Curie temperature. Builds on the work of https://github.com/Songyosk/CurieML. Dataset includes ~6,800 unique materials representing 3,284 unique chemical families.
Searching the 13,000 chemical formulas in Materials Project database found 6,800 different crystal structures represented by 3284 unique chemical families. This is because a single stoichiometry could have multiple matches in the database, for example ZrVFe
matches mp-1215241 and mp-1215261. Looking at this example closer, there does not appear to be a significant difference in the two structures. In this case, we can use min energy above hull to choose the more likely structure.
Even though we were able to find 6,800 matches in MP, there are only ~3,000 unique stoichiometries represented with no clear answer as to which is the correct structure. Also very possible the actual structure is not one of the matches. Some of this we will never know so we need to be okay with "good enough".
Following some of the techniques from 3DSC, we could:
Instead of only looking for exact matches, allow for matching chemical formulas that differ by a constant factor. For example, CuLa2O4 could be matched with Cu2La4O8 (with a relative factor of 1/2).
Rank matches by: a) Energy above hull (Ehull) - lower values preferred b) Total weighted relative difference (Δtotrel) - lower values preferred.
As a data scientist should know, it's better to spend your time with the dataset. Nothing more needed beyond XGBoost/CatBoost and some cross validation. Funny enough, the more papers I read on property prediction, the more I find people 'independently' discovering that random forests/decision trees do the best. Same findings in the DS world.
I'll be going back to continue to improve the dataset but I wanted to have something end-to-end before refining.
This experiment takes the same approach to the work I did on critical temperature prediction for superconductors. We take the latent vector of a GNN (graph neural network) MLIP (machine learning interatomic potential) model and use that as our feature vector.
I've been finding that MLIP is important compared to MLFF because of the magnetic understanding of the material. We see those results replicated here where CHGNet outperforms Orb despite Orb generally performing better on MD and benchmarks.
Interestingly, the CHGNet feature vector is only dim 128 where Orb is 256. Additionally, CHGNet is 500K params where Orb is 25M...
I'll be evaluating some other models with magnetic understanding soon too, likely pushing performance even further. For example, SevenNet:
New MLIP model on the leaderboards! Currently #2 with an F1 score of 0.884. Congrats to the team. They provide a few pre-trained models as well as a ASE calculator for MD. Great stuff. It's a graph ne
Decent results so far. I know there is still so much we can do to improve the dataset, and training on multiple structures with the same Tc doesn't feel right so I'm not holding much to these initial results. There is also plenty of feature engineering to do. Beyond the latent vector, we could include mean and sum magnetic moment. This was found to be the single most predictive feature by Jung et al., the original authors of this dataset.
R-squared value of 0.89, we look expected vs. predicted temperatures for a test set of ~1200 materials.
Not bad. A lot of variance below 400K but if we're primarily using this model as a material screening filter, that range matters less. Despite a relatively high R2, I think there is a lot of room to improve.
Cross-validation also told a somewhat different story. In 5-fold cross validation, the metrics were as follows:
Best R² score: 0.713
Mean MAE: 111.523
Mean RMSE: 165.627
How this post is connected to other assets
We're starting to bring a few of the pieces together in our permanent magnet screening pipeline. In this post we'll look at how well we are able to filter out materials from a list of ~5000 ferro/ferr
Inspired by 's Project 014, we've exposed our Curie temperature prediction model so that you can test your own materials!
After wrestling with Mattergen finetuning for longer than I would've liked to, I pivoted back to simple property conditioned generation on Zn-Mg-H systems per 's recommendation. Each generated system
Discover other posts like this one
To best summarize what we're looking for its worth outlining how the current state, (NdFeB) magnets, dominates and why an alternative is needed.NdFeB magnets are the strongest type of permanent magnet
Extending the comparison to a different model CHGNet, this time a proper MLIP. Similar to the Orb model, this model predicts energy, force, and stress, but with the addition of the magnetic moment for
Over the last week or so, I've been working on making some upgrades to the superconducting state classifier model. See the first attempt here: