In an effort to get far away from .cif representations, and abandon any conventional wisdom about encoded symmetry and physically grounded rules, we've been experimenting with novel feature representations, and ultimately new methods to generate inorganic crystal structures.
GPSK-01, GPSK-05, and GPSK-300 are heavily inspired by frontier diffusion transformer techniques, namely TRELLIS.2 and the original TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation. Each model operates on a voxel grid, the contents of which vary significantly from one model to the next.
The goal of this blog is to intro the concepts, present research results, and discuss our research direction beyond GPSK-300. This release will be closely followed with a full paper that gets deeper into the technical weeds.
A large chunk of our research effort was spent on experimenting with different ways to represent these crystal systems. There is no wrong answer here, and much of the decision making around feature representation depends on the model architecture you end up working with. When working with diffusion transformer architectures, the following general guidelines worked best for us:
features should be dense, because very sparse 3D tensors tend to behave badly under MSE-style training
features should be fixed-size, so the model always sees the same tensor shape
features should be permutation-invariant, because atoms do not come with a canonical ordering
features should be lattice-aware, so the unit cell is recoverable from the representation itself. This is more a limitation of evaluation tooling than anything else. We need a lattice to get to a pymatgen Structure object, and that is prerequisite for any computational work downstream.
and ideally, features should be invertible, so decoding back to a crystal is a direct mathematical operation, not some hacked learned guess that gets saved by a relaxation.
The three GPSK models reflect the iterative research path taken to arrive at the above set.
GPSK-01 was our first attempt. The goal was to follow the TRELLIS playbook and model these systems as 3D objects in real space. The intuition was brutally simple; atoms and larger crystals exist in real space as 3D things, so let's just render them in a voxel grid with fractional coordinates and see what happens. Within this voxel grid, we blurred atoms into Gaussian peaks on a single-channel density field, and a transformer-based generator learned to produce new density fields that looked like the training distribution.
This representation had a lot going for it. It was dense, it was fixed-size, it was permutation-invariant, and at inference time a structure could be recovered by finding peaks in the generated field. Conditioning on composition was relatively straightforward: pool the composition into a vector, project it, and mix it into the generator through attention or cross-conditioning.
What it was missing was any concept of the lattice itself. The density lived in fractional coordinates, which means a unit cell with Å and one with Å looked identical to the model — the density field only saw normalized positions in . To recover a real crystal you had to either carry the lattice as side-channel metadata, or make a separate guess after the fact.
This was just a fundamental miss. The training was progressing well, and we were excited to see the system do so well so quickly that we overlooked some not so minor details. Generation worked, peak extraction worked, but the output was always a partial crystal. Something downstream had to supply the lattice, and the model had no way to learn whether its output was compatible with any particular lattice choice.
GPSK-05 was the response to GPSK-01's missing pieces (and then some). Rather than throw out the real-space representation, we built out the rest of the pipeline around it. The generator became a much larger diffusion transformer with AdaLN conditioning (around 1B parameters, trained at 128³ resolution with periodic boundary conditions via circular padding so that the network respected lattice symmetry), and every part of the real-space → crystal decode that GPSK-01 had handled heuristically became its own learned model.
The full pipeline looked like this:
A periodic VAE compressed the 128³ density down to a 16³×64 latent. Circular-padded 3D convolutions preserved periodic boundary behavior, and the latent space made the main diffusion model tractable.
A DiT with AdaLN modulation generated latents conditioned on composition. Composition was pooled, added to the timestep embedding, and used to modulate every LayerNorm in the network. This worked well because the composition signal was global and could not be ignored by the model — it reshaped every activation.
A density element classifier with FiLM composition conditioning assigned elements to peaks in the decoded density. Rather than "heaviest element gets the brightest peak," the classifier learned to map local density patterns to element identities, with stoichiometry enforced by Hungarian assignment against the target composition.
A lattice predictor CNN took the generated density and regressed the six lattice parameters directly (bit of a hack, but it got us results quickly).
A CHGNet energy scan fit the absolute cell volume by holding the predicted axis ratios fixed and scanning volume per atom from 8 to 40 ų, picking the minimum. This acted like a computational equation of state at inference time (this was an Opus suggestion, and also a hack).
Best-of-N candidate selection: generate 20 candidates per request, score each by CHGNet energy and connectivity, return the best.
Believe it or not, doing sweeps for favorable configurations after generation enables some impressive performance. GPSK-05 hit 93.5% validity, 25.27% MSUN, and 0.95% strict SUN on a 2,476-structure submission, with 98.7% novelty and 100% uniqueness. The current published state of the art on the same benchmark is Crystalite at 22.6% MSUN and MatterGen at 15.0% MSUN. GPSK-05 is SOTA for generating MSUN materials.
Crystalite generates roughly twice as many metastable structures as we do (51.6% vs 26.0%), but over half of those match existing LeMat-Bulk entries, which caps their MSUN. GPSK-05 generates fewer metastable candidates but with 98.7% novelty, so almost every metastable structure is also novel. Crystalite is very good at recovering known stable chemistry; GPSK-05 wins out by finding new stable chemistry.
These results all come with massive asterisks, we had to do tons of modeling work post-hoc. The full pipeline includes a VAE, a DiT, an element classifier, a lattice predictor, and an MLIP scan. Comparing Crystalite and other generative models for inorganic crystals against GPSK-05 isn't an apples to apples comparison. But it is worth noting that more exploratory multi-step pipelines have the ability to perform incredibly well on these benchmarks.
GPSK-300 is our attempt at a cleaner solution.
GPSK-300 is a 304M-parameter multimodal diffusion transformer that generates crystals in reciprocal space rather than real space. More importantly, the representation is fully invertible: every 32³×3 grid the model generates can be decoded directly into a pymatgen Structure object, including lattice parameters, atomic positions, and composition. No auxiliary regressors. No learned decoder for the lattice. No metadata passed in on the side.
In reciprocal space, a crystal can be represented through its structure factor, which is defined on a grid of Miller indices. That gives us something extremely attractive for generative modeling:
it is naturally grid-shaped
it is much denser than atom positions in Cartesian space
and through the inverse Fourier transform, it gives us back the underlying spatial information
The structure factor on a fixed integer- grid does not fully encode the lattice by itself. It captures fractional positions and scattering behavior well, but lattice recovery from that signal alone is too weak and too ambiguous for a generative model to reliably learn.
So GPSK-300 uses a three-channel reciprocal-space representation:
the real part of the structure factor
the imaginary part of the structure factor
a reciprocal-metric channel based on
We represent each crystal on a 32³ grid of Miller indices . At every grid point, the model predicts three values:
The first two channels are the familiar complex structure factor split into real and imaginary parts. We compute with proper angle-dependent Cromer-Mann scattering factors — not atomic-number constants — so that the structure factor carries element-specific signal at every Miller index.
The third channel is the important one. It stores the reciprocal-space metric field:
This is just a quadratic form over Miller indices. And that matters because a lattice has six degrees of freedom . The reciprocal metric tensor also has six independent coefficients. So when we evaluate that quadratic form over the entire 32³ grid, we are massively overdetermining those six lattice parameters.
In plain English: the model is not being asked to "guess the lattice." It is being asked to generate a smooth quadratic field, and from that field we can solve for the lattice directly with least squares.
That is the key design choice in GPSK-300, and it is the piece that GPSK-01 and GPSK-05 never had.
The full pipeline has two stages.
First, a 3D VAE compresses the 32³×3 reciprocal-space grid down to an 8³×64 latent. This gives us a compact and tractable space for our diffusion process. The VAE is much smaller than GPSK-05's (about 2.4M parameters vs ~80M) because the reciprocal representation is easier to compress cleanly — the channel in particular is a smooth quadratic, which convolutional networks encode almost losslessly.
A multimodal diffusion transformer operates in this latent space. Unlike GPSK-05's pure AdaLN DiT, the MMDiT uses double-stream blocks — joint attention over image tokens and conditioning tokens — before falling through to single-stream blocks on the merged sequence. AdaLN-Zero modulation on every block gives a global conditioning signal that the model cannot ignore, while the double-stream attention lets conditioning tokens attend to specific spatial locations in the latent grid. Conditioning covers composition, crystal system, space group, band gap, formation energy, energy above hull, and magnetic ordering.
Training uses rectified flow matching. Given a latent and Gaussian noise , interpolate and train the model to predict the velocity . At inference we integrate with 50 Euler steps and classifier-free guidance, dropping conditioning tokens 10% of the time during training.
That sounds like a lot, but conceptually it is pretty simple:
the representation makes inversion possible
the VAE makes training practical (free compression)
the flow model learns how to generate new latent crystal representations under conditioning
The novelty here is less about inventing a new diffusion trick and more about choosing a representation that lets the model generate something physically structured and decodable from the start.
Once the model produces a 3-channel grid, reconstruction is closed-form.
For atom positions, we combine the real and imaginary structure-factor channels into a complex field and take an inverse FFT. That gives a fractional electron-density map. Peaks in that map correspond to atomic sites.
For the lattice, we fit the quadratic form in channel 2 across the Miller-index grid and recover the reciprocal metric tensor. From there, we invert into real-space metric form and read off . No CNN regressor, no CHGNet volume scan.
Composition assignment is still the least elegant part of the pipeline. Right now, the decoder assumes the requested formula and assigns heavier elements to brighter peaks. This works surprisingly well in many cases, because Cromer-Mann scattering magnitude scales with atomic number, so heavy atoms do naturally produce the brightest peaks. But it is still heuristic, especially for neighboring elements like Fe/Co/Ni where scattering intensity is too similar to cleanly separate identity. This is high on the list of future improvements; the representation is fully invertible in the sense that geometry and lattice are directly recoverable, but element identity still has room to become more principled. A more careful decode would fit the Cromer-Mann coefficients directly from the pattern around each peak and extract that way — something GPSK-05's learned classifier essentially did, but which the structure factor makes accessible in closed form.
The strongest result here is that the representation actually behaves the way we hoped.
The reciprocal-metric channel makes lattice recovery extremely stable. In the noise-free setting, fitting from channel 2 recovers lattice parameters with essentially exact numerical accuracy. In practice, the VAE also reconstructs that channel extremely well (MSE around $10^{-6}$), which is a good sign that the representation is easy for the network to learn.
On the generation side, the model shows strong extraction success and good lattice accuracy in domains that are well represented in the data. On a targeted evaluation of 21 magnetic compositions with 10 candidates each, we see 95% extraction success and median lattice ratios of 0.98–0.99 against reference values. All seven L1₀ tetragonal magnets in the test set (FePt, CoPt, FeNi, FePd, MnAl, MnGa, MnAlC) come back with near-perfect lattices. All three hexagonal rare-earth transition-metal compounds (SmCo₅, YCo₅, CeCo₅) do as well.
On conditioning, all six modalities eventually emerge as measurable signals in the generated output. Space group and crystal system come in first during training, composition next, and the continuous properties — band gap, formation energy, magnetic ordering — emerge later. With best-of-N sampling, all six show strong correlation response at the final checkpoint. TiO₂'s c-axis cleanly scales with the formation energy condition. Fe₂O₃ under AFM conditioning produces , which is very close to real hematite's 2.72.
For some classes of materials, especially simple rock-salt ionics like NaCl, MgO, CaO, NiO, and MnO, the model tends to generate structures with lattice constants that are systematically too small — roughly 25 to 30 percent below the reference. The motifs are often right. The symmetry is often right. The composition is right. But the cell is compressed.
This is the class of error GPSK-05 papered over with its CHGNet volume scan. The DiT would produce a reasonable density field, the lattice predictor would give approximate axis ratios and angles, and then the energy scan would do the heavy lifting to find the correct absolute cell size by walking volume per atom. GPSK-300 doesn't have that fallback, because we explicitly chose not to rely on post-hoc energy relaxation.
Bringing in a stability-aware reward signal, likely through reinforcement learning-style fine-tuning or diffusion-time optimization against an ML interatomic potential, is a clear next step.
GPSK-01 showed that voxel density fields in fractional coordinates can be learned and decoded into rough crystals. It also showed that real-space density alone is incomplete — you get atom positions, not the lattice.
GPSK-05 responded by layering learned networks on top of the density representation: a VAE, a DiT, a classifier for elements, a regressor for lattice, and an MLIP for absolute cell size. The results were strong, but primarily because of time spent post processing.
GPSK-300 moved the problem down into the representation itself. Once you express a crystal in reciprocal space with the right three channels, geometry and lattice both become directly recoverable, and the decoder collapses into a few lines of FFT and linear algebra. The model only has to generate a coherent grid — everything else is closed-form.
That does not solve every part of crystal generation. The element identity decode is still heuristic. Simple ionic lattices are still systematically compressed. Multi-formula-unit supercells still hit ceilings imposed by the 32³ grid. But it does solve one of the most annoying structural problems in the pipeline: how to make the model output something that already contains the full crystal, instead of something that still needs to be recovered or massaged back into one.
We're going to continue to run with this reciprocal space representation, and we'll report back with findings soon. More to come.
On this page
This post discusses the innovative GPSK models (GPSK-01, GPSK-05, and GPSK-300) aimed at advancing the generation of inorganic crystal structures. It details the research findings on feature representations, highlighting the evolution of each model's architecture and methodologies based on diffusion transformer techniques. The study showcases how the models operate on voxel grids, emphasizing the importance of dense, fixed-size, permutation-invariant, and lattice-aware features. Insights into the generative processes and specific computational techniques, such as periodic VAEs and multimodal diffusion transformers, provide a comprehensive understanding of the challenges and breakthroughs in crystal generation. Ideal for researchers in computational materials science and machine learning.