This paper introduces VADER, a method for aligning video diffusion models using reward gradients. The goal is to adapt pre-trained video diffusion models to specific tasks without requiring large, manually curated datasets. VADER leverages reward models, which are trained on top of powerful vision discriminative models, to provide dense gradient information that enables efficient learning in complex search spaces like videos. By backpropagating gradients from these reward models to the video diffusion model, VADER achieves more efficient learning in terms of reward queries and computation compared to prior gradient-free approaches.
The paper demonstrates that as the complexity of generation increases (from images to videos), the gap between reward gradient-based and policy gradient-based approaches widens. This is due to the additional feedback that is backpropagated to the model. VADER is shown to significantly improve upon base model generations across various tasks and outperforms alternative alignment methods that do not utilize reward gradients, such as DPO or DDPO. It also generalizes well to prompts not seen during training.
VADER is flexible and can be applied to both text-to-video and image-to-video diffusion models. It uses a variety of reward functions, including image-text similarity, aesthetic scores, object removal, video action classification, and temporal consistency. The method is trained on a single GPU with 16GB VRAM, using techniques such as LoRA, mixed precision, gradient checkpointing, and truncated backpropagation to reduce memory overhead.
The results show that VADER is more sample and computationally efficient than other reinforcement learning approaches like DDPO or DPO. It also demonstrates strong generalization ability and performs well in human evaluations. Qualitative results show that VADER produces high-quality, aligned video content that matches the text prompts and improves upon the base model in various tasks. The method is shown to be effective in improving the temporal and spatial consistency of video generation, particularly for long-range videos.This paper introduces VADER, a method for aligning video diffusion models using reward gradients. The goal is to adapt pre-trained video diffusion models to specific tasks without requiring large, manually curated datasets. VADER leverages reward models, which are trained on top of powerful vision discriminative models, to provide dense gradient information that enables efficient learning in complex search spaces like videos. By backpropagating gradients from these reward models to the video diffusion model, VADER achieves more efficient learning in terms of reward queries and computation compared to prior gradient-free approaches.
The paper demonstrates that as the complexity of generation increases (from images to videos), the gap between reward gradient-based and policy gradient-based approaches widens. This is due to the additional feedback that is backpropagated to the model. VADER is shown to significantly improve upon base model generations across various tasks and outperforms alternative alignment methods that do not utilize reward gradients, such as DPO or DDPO. It also generalizes well to prompts not seen during training.
VADER is flexible and can be applied to both text-to-video and image-to-video diffusion models. It uses a variety of reward functions, including image-text similarity, aesthetic scores, object removal, video action classification, and temporal consistency. The method is trained on a single GPU with 16GB VRAM, using techniques such as LoRA, mixed precision, gradient checkpointing, and truncated backpropagation to reduce memory overhead.
The results show that VADER is more sample and computationally efficient than other reinforcement learning approaches like DDPO or DPO. It also demonstrates strong generalization ability and performs well in human evaluations. Qualitative results show that VADER produces high-quality, aligned video content that matches the text prompts and improves upon the base model in various tasks. The method is shown to be effective in improving the temporal and spatial consistency of video generation, particularly for long-range videos.