Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

2024 | Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. This approach combines the strengths of next-token prediction models and full-sequence diffusion models. It enables variable-length generation and the ability to guide sampling to desirable trajectories. The method allows for generating sequences of continuous tokens beyond the training horizon, where baselines diverge, and introduces new sampling and guiding schemes that lead to performance gains in decision-making and planning tasks. The method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. The approach is implemented as Causal Diffusion Forcing (CDF), which uses a causal architecture to generate variable-length sequences and enables Monte Carlo Guidance (MCG), which improves the sampling of high-reward generations compared to non-causal full-sequence diffusion models. The method is evaluated across diverse domains such as video generation, model-based planning, visual imitation learning, and time series prediction, demonstrating its unique capabilities. The paper also discusses related work and provides a detailed comparison with existing methods. The results show that Diffusion Forcing outperforms baselines in video prediction, planning, and imitation learning tasks. The method is shown to be effective in real-world robotics applications, including long-horizon imitation learning and robust visuomotor control. The paper concludes that Diffusion Forcing is a promising approach for sequence generation and decision-making tasks.Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. This approach combines the strengths of next-token prediction models and full-sequence diffusion models. It enables variable-length generation and the ability to guide sampling to desirable trajectories. The method allows for generating sequences of continuous tokens beyond the training horizon, where baselines diverge, and introduces new sampling and guiding schemes that lead to performance gains in decision-making and planning tasks. The method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. The approach is implemented as Causal Diffusion Forcing (CDF), which uses a causal architecture to generate variable-length sequences and enables Monte Carlo Guidance (MCG), which improves the sampling of high-reward generations compared to non-causal full-sequence diffusion models. The method is evaluated across diverse domains such as video generation, model-based planning, visual imitation learning, and time series prediction, demonstrating its unique capabilities. The paper also discusses related work and provides a detailed comparison with existing methods. The results show that Diffusion Forcing outperforms baselines in video prediction, planning, and imitation learning tasks. The method is shown to be effective in real-world robotics applications, including long-horizon imitation learning and robust visuomotor control. The paper concludes that Diffusion Forcing is a promising approach for sequence generation and decision-making tasks.
Reach us at info@study.space