This paper introduces LLaDA, a diffusion model trained from scratch under the pre-training and supervised finetuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. https://arxiv.org/abs/2502.09992
Thanks for sharing Will. Your last paragraph hits. It's exciting because it means we can wield powerful generative AI in non-linear output domains. Paper for the curious: