Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

11 Feb 2025 | Jingyang Ou1,2, Shen Nie1,2, Kaiwen Xue1,2, Fengqi Zhu1,2 Jiacheng Sun3, Zhenguo Li3, Chongxuan Li1,2*
This paper introduces Reparameterized Absorbing Discrete Diffusion (RADD), a novel diffusion model that characterizes time-independent conditional probabilities. The key contribution is the revelation that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data multiplied by a time-dependent scalar. This insight leads to the development of RADD, which simplifies the model architecture and reduces the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged. RADD achieves state-of-the-art performance on five zero-shot language modeling benchmarks at the GPT-2 scale, demonstrating superior efficiency and effectiveness compared to existing models. The paper also unifies absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that their training objectives are equivalent. This unification provides a fresh perspective on the negative log-likelihood of absorbing discrete diffusion and offers alternative objective functions for training and likelihood evaluation.This paper introduces Reparameterized Absorbing Discrete Diffusion (RADD), a novel diffusion model that characterizes time-independent conditional probabilities. The key contribution is the revelation that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data multiplied by a time-dependent scalar. This insight leads to the development of RADD, which simplifies the model architecture and reduces the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged. RADD achieves state-of-the-art performance on five zero-shot language modeling benchmarks at the GPT-2 scale, demonstrating superior efficiency and effectiveness compared to existing models. The paper also unifies absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that their training objectives are equivalent. This unification provides a fresh perspective on the negative log-likelihood of absorbing discrete diffusion and offers alternative objective functions for training and likelihood evaluation.
Reach us at info@study.space
[slides] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data | StudySpace