[slides and audio] Simple and Effective Masked Diffusion Language Models

This paper explores the performance of masked diffusion models in language modeling, demonstrating that simple masked discrete diffusion can be more effective than previously thought. The authors propose an effective training recipe and derive a simplified, Rao-Blackwellized objective that improves the performance of masked diffusion models. This objective is a mixture of classical masked language modeling losses and can be used to train encoder-only language models with efficient samplers, including those that can generate text semi-autoregressively. On language modeling benchmarks, the proposed models achieve state-of-the-art performance among diffusion models and approach the perplexity of autoregressive (AR) models within 15-25%. The framework extends to non-language domains, such as biological sequence modeling, where pre-trained DNA sequence models show similar or higher downstream performance compared to classical BERT-style training. The authors also provide a well-engineered implementation of masked diffusion models, which significantly boosts performance even for methods previously considered to perform poorly.This paper explores the performance of masked diffusion models in language modeling, demonstrating that simple masked discrete diffusion can be more effective than previously thought. The authors propose an effective training recipe and derive a simplified, Rao-Blackwellized objective that improves the performance of masked diffusion models. This objective is a mixture of classical masked language modeling losses and can be used to train encoder-only language models with efficient samplers, including those that can generate text semi-autoregressively. On language modeling benchmarks, the proposed models achieve state-of-the-art performance among diffusion models and approach the perplexity of autoregressive (AR) models within 15-25%. The framework extends to non-language domains, such as biological sequence modeling, where pre-trained DNA sequence models show similar or higher downstream performance compared to classical BERT-style training. The authors also provide a well-engineered implementation of masked diffusion models, which significantly boosts performance even for methods previously considered to perform poorly.

Simple and Effective Masked Diffusion Language Models

11 Jun 2024 | Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov