LLaDA Learnings LLaDA ch...

Ouro

Docs
Blog

Join for free Open app

Teams
Search

Assets

Posts
APIs
Data

LLaDA Learnings LLaDA ch... · Posts on Ouro

6mo

390 views

0 comments

LLaDA Learnings

LLaDA challenges the conventional reliance on autoregressive models (ARMs) for large language modeling. Instead of predicting text token by token, LLaDA uses a diffusion framework with a forward “masking” process and a reverse process that simultaneously recovers multiple tokens. This bidirectional, iterative approach emphasizes a more general generative modeling principle—minimizing KL divergence between the data distribution and the model distribution—rather than being inherently tied to autoregressive formulations.

The motivation for diving a bit deeper into this paper was to hopefully uncover some meaningful insights that could translate into our follow-on experiments with MatterGen.

Some of the key innovations with respect to the LLM domain include:

Forward Diffusion via Random Masking:
Instead of a fixed masking ratio as in traditional masked language models, LLaDA applies a variable masking probability (sampled uniformly over [0,1]) across the entire sequence. This flexible scheme allows the model to learn a principled likelihood bound and to potentially capture richer contextual dependencies.
Reverse Process with Mask Predictor:
A vanilla Transformer is used to predict all masked tokens simultaneously during the reverse process. This non-sequential approach grants the model bidirectional context and helps alleviate issues such as high inference latency and limitations in reversal reasoning (e.g., tasks where predicting “what came before” is challenging for ARMs).
Scalability and Generality:
LLaDA demonstrates competitive performance on tasks including in-context learning and instruction following, showing that diffusion models can scale comparably to autoregressive models even when trained from scratch.

2 comments

6mo

Join to comment

Reversal Reasoning:

One particularly interesting aspect is its ability to “reverse” the generation process—yielding balanced performance when asked to complete text both forward and backward—which may be useful in domains where symmetry or bidirectionality is beneficial.

On the input representation front, this random masking approach is similar to one of 3 feature encoding strategies employed by MatterGen. The idea of symmetry preservation (preservation is probably not the right word, maybe symmetric understanding?) in the de-noising process seems like it could be more powerful than being a nice gimmick that current LLMs struggle with. That being said, it's also very easy to see how reversing a list of words is far simpler than maintaining 3 dimensional symmetry in a unit cell.

Broadly, this stark difference between the input feature complexity illustrates what I think should be a core focus area for us to consider. How can we best represent a crystal structure for any given modeling approach. MatterGen effectively decomposes the input structures into atom types, fractional coordinates, and lattice parameters, each of which is noised according to slightly varying processes depending on the feature preservations requirements (periodicity for the coordinates, categorical representations for the atoms, symmetry constraints for lattice parameters). Maybe there is room to improve here with extra granularity, a different coordinate system, a separate symmetry feature, etc.

What I think was more exciting than anything is that LLaDA proved that its diffusion architecture exhibited similar scaling laws to the ones seen with our widely accepted ARMs. This was an open question I was trying to answer and an experiment I never quite got working. It's not like these scaling laws are actionable for us due to extremely prohibitive cost constraints, but maybe it'll be useful knowledge some day.

posts

posts

LLaDA Learnings

2 comments