2025 | Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li
This paper introduces RADD, a reparameterized absorbing discrete diffusion model that characterizes time-independent conditional probabilities. The key insight is that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data multiplied by a time-dependent scalar. This allows for efficient sampling by caching the output of the time-independent network, reducing the number of function evaluations (NFEs) and accelerating sampling. RADD unifies absorbing discrete diffusion with any-order autoregressive models (AO-ARMs), showing that their training objectives are equivalent. Theoretical analysis demonstrates that the upper bound on the negative log-likelihood for diffusion models can be interpreted as an expected negative log-likelihood for AO-ARMs. Empirically, RADD achieves state-of-the-art performance on five zero-shot language modeling benchmarks at the GPT-2 scale. The model simplifies parameterization, improves sampling efficiency, and enhances zero-shot language modeling performance. RADD is trained on different objective functions, achieving superior results compared to existing models. The paper also discusses related works, including continuous and discrete-state diffusion models, and highlights the advantages of RADD in terms of efficiency and performance.This paper introduces RADD, a reparameterized absorbing discrete diffusion model that characterizes time-independent conditional probabilities. The key insight is that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data multiplied by a time-dependent scalar. This allows for efficient sampling by caching the output of the time-independent network, reducing the number of function evaluations (NFEs) and accelerating sampling. RADD unifies absorbing discrete diffusion with any-order autoregressive models (AO-ARMs), showing that their training objectives are equivalent. Theoretical analysis demonstrates that the upper bound on the negative log-likelihood for diffusion models can be interpreted as an expected negative log-likelihood for AO-ARMs. Empirically, RADD achieves state-of-the-art performance on five zero-shot language modeling benchmarks at the GPT-2 scale. The model simplifies parameterization, improves sampling efficiency, and enhances zero-shot language modeling performance. RADD is trained on different objective functions, achieving superior results compared to existing models. The paper also discusses related works, including continuous and discrete-state diffusion models, and highlights the advantages of RADD in terms of efficiency and performance.