10 Dec 2024 | Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
This paper introduces Diffusion Forcing, a novel training paradigm for sequence generative models. It combines the strengths of next-token prediction models, which are effective for variable-length generation, with the capabilities of full-sequence diffusion models, which excel at guiding sampling to desirable trajectories. The key innovation is to train a model to denoise tokens with independent, per-token noise levels, allowing for flexible and causal sequence generation. The authors implement this approach as Causal Diffusion Forcing (CDF), which can generate variable-length sequences and perform long-horizon guidance. CDF is evaluated on various tasks, including video generation, planning, and imitation learning, demonstrating its ability to stabilize long-horizon generation, keep the future uncertain, and achieve effective guidance. The method is formally proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.This paper introduces Diffusion Forcing, a novel training paradigm for sequence generative models. It combines the strengths of next-token prediction models, which are effective for variable-length generation, with the capabilities of full-sequence diffusion models, which excel at guiding sampling to desirable trajectories. The key innovation is to train a model to denoise tokens with independent, per-token noise levels, allowing for flexible and causal sequence generation. The authors implement this approach as Causal Diffusion Forcing (CDF), which can generate variable-length sequences and perform long-horizon guidance. CDF is evaluated on various tasks, including video generation, planning, and imitation learning, demonstrating its ability to stabilize long-horizon generation, keep the future uncertain, and achieve effective guidance. The method is formally proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.