Victor from Lila Sciences sent me this paper he co-authored after he saw some of the work we were doing on AI agents for materials discovery. Check out the paper here:
Discovering new materials can have significant scientific and technological implications but remains a challenging problem today due to the enormity of the chemical space. Recent advances in machine learning have enabled data-driven methods to rapidly screen or generate promising materials, but these methods still depend heavily on very large quantities of training data and often lack the flexibility and chemical understanding often desired in materials discovery. This paper introduces LLMatDesign, a novel language-based framework for interpretable materials design powered by large language models (LLMs).
Right away from the abstract I see we share the same reasoning for why one might want to use LLMs to do discovery:
Recent advances in machine learning have enabled data-driven methods to rapidly screen or generate promising materials, but these methods still depend heavily on very large quantities of training data and often lack the flexibility and chemical understanding often desired in materials discovery.
Lack of data means good machine learning models can't be trained, so we defer to trying to use intelligence to go directly to our solution instead.
For RE-free permanent magnet discovery, this was a motivating factor because of the difficulty predicting MAE as well taking the rare-earth-free constraint into account.
... These methods are less useful in most instances where such data is unavailable, or when only a limited budget exists to perform experiments or high fidelity simulations.
And budget is a factor too, which they call out.
Let's get into the design of LLMatDesign.
Overview of LLMatDesign. The discovery process with LLMatDesign begins with user-provided inputs of chemical composition and target property. It recommends modifications (addition, removal, substitution, or exchange), and uses machine learning tools for structure relaxation and property prediction. Driven by an LLM, this iterative process continues until the target property is achieved, with self-reflection on past modifications fed back into the decision-making process at each step.
It's a little shocking how similar our workflows are. Honestly close enough to say they are identical except for the DFT step which I don't have access to.
It took me a few iterations of design to converge on the idea of letting the LLM make modifications to the crystal (I called them mutations, a la genetic algorithms). And here it is, same approach.
We had trouble using generative crystal structure models because from iteration to iteration the crystals would be too different to learn a "gradient" and continue to improve
We had trouble getting the LLM to write the structure directly because of the complexity involved with CIF generation and Wyckoff rules. We are still working on this though
So finding a happy medium between maximum control over structure and minimum, we landed on the mutation approach.
LLMatDesign modifies the material based on the given suggestion, relaxes the structure using a machine learning force field (MLFF), and predicts its properties using a machine learning property predictor (MLPP). If the predicted property of the new material does not match the target value within a defined threshold, LLMatDesign then evaluates the effectiveness of the modification through a process called self-reflection where commentary is provided on the success of failure of the chosen modification.
Very cool. I wonder if they had the same issues we had of most structures ending up as P1. I love the mutation approach but applying them directly in crystal space is not going to work I think. Adding a new site or swapping atoms works but after relaxation you're likely to have a very different structure. Symmetry that worked before is unlikely to work after the op, and usually the result of this is falling all the way down to P1.
In their experiment, they focus on designing materials targeting two material properties and their corresponding objectives:
Band gap (eV): design a new material with a band gap of 1.4 eV.
Formation energy per atom (eV/atom): design a new material with the most negative formation energy possible.
The authors record the average number of modifications taken by LLMatDesign, with a maximum budget of up to 50 modifications. For the formation energy experiments, a fixed budget of 50 modifications is used, and both the average and minimum formation energies are recorded. The experiment is then repeated 30 times for each starting material.
Average band gaps and formation energies over 50 modifications. The grey horizontal line indicates the target band gap of 1.4 eV. The colored dots on the x-axis indicate the average number of modifications taken for each method to reach the target. For formation energy, the goal is to achieve the lowest possible value.
We observe that GPT-4o with past modification history performs the best in achieving the target band gap value of 1.4 eV, requiring an average of 10.8 modifications (Table 1). In comparison, Gemini-1.0-pro with history takes an average of 13.7 modifications. Both methods significantly outperform the baseline, which requires 27.4 modifications. Adding modification history to subsequent prompts allows the LLMs to converge to the target more quickly, as both Gemini-1.0-pro and GPT-4o with modification history outperform their historyless counterparts.
The results are clearly better than random, but I'm surprised by the lack of differentiation between the versions with and without mutation history. There is some difference, but it's not as much as we'd hope.
Both history and historyless variants of Gemini-1.0-pro and GPT-4o demonstrate quick convergence to the target band gap. However, the GPT-4o historyless variant exhibits zig-zag oscillations in band gap values as modifications increase. This occurs because, without historical information, GPT-4o tends to oscillate between a few of the same moves, causing the band gap to fluctuate without improving.
Makes sense. It seems to imply intelligence in that it knows the right move, but because of a missing history it doesn't know how to fine-tune that right move to get to the target.
To quantify the effect of self-reflection on the performance of LLMatDesign, we conduct band gap experiments using GPT-4o and the same set of 10 starting materials, where we aim to find a new material with a target band gap of 1.4 eV. ... As previously discussed, GPT-4o with history achieves an average of 10.8 modifications, while GPT-4o without history requires 26.6 modifications. In comparison, GPT-4o with history but without self-reflection now needs an average of 23.4 modifications, which is over twice as many compared to including self-reflection.
Great finds. Though I wonder if this is cherry-picked a bit because the historyless version also appears to get near the target quite quick but oscillates and is outside the 10% error buffer for a while.
It is clear the history version is better, and clear that adding self reflection does indeed help. I'm just curious how history without self-reflection is just about as bad as historyless (avg. 26.6 mutations for historyless vs 23.4 history w/o self-reflection).
Well-crafted prompts are essential for eliciting accurate and useful responses from LLMs. While the base prompt template, shown in Fig. 2, works as intended, we subsequently show that optimizing this prompt can improve the performance of LLMatDesign even further.
Hey! This is where using DSPy seems to have been a good choice in my implementation. Prompts are clearly important, change effectiveness across different models - so why not let the model itself write them. It also appears the authors didn't use chain-of-thought, which in 2025 is a well known way to squeeze out more performance.
Materials discovery with constraints ensures scientific, economic, and political viability. For instance, avoiding the use of rare earth metals can reduce dependency on limited and expensive resources, mitigate supply chain risks, and align with environmental and ethical standards.
Cool! Exact reason we're going with this approach too. Rare-earth-free permanent magnet design has these exact requirements.
Though this paper is just over a year old, I feel like it was the start of something a lot of researchers and orgs have been working at for the past year. Automated materials discovery - encompassing the computational, exploratory, and experimental parts of it.
I know of a handful of companies that are working on it, and I even made a post trying to learn more about some of them, Lila Sciences being one of these. https://x.com/mmoderwell/status/1969055212745736452
I've made my AI scientist open source after reading this too. https://github.com/ourofoundation/scientist
With LLMatDesign and with my own implementation, we both found good results. It does almost exactly what you want it to do, but there are some shortcomings still.
Mutating a crystal works but it ignores symmetry preservation and the idea of trying to make the smallest possible change such that you can accurately see how your hypothesized change effects the properties. This is because of the MLIP relaxation step after mutation but it's not like we can skip that step.
More work is needed on MLPPs. If this is to be a cost-effective alternative to building huge generative models, you need to be able to predict the properties you're after efficiently. This assumes data availability, an assumption we made prior to not be the case. For REFPM discovery, this is still out of reach for MAE, but not for long.