24 Jun 2024 | Zhenxiong Tan * Xingyi Yang * Songhua Liu Xinchao Wang †
Video-Infinity is a distributed inference pipeline designed to generate long-form videos using multiple GPUs. The primary challenges in producing longer videos include memory requirements and extended processing times on a single GPU. To address these issues, Video-Infinity introduces two key mechanisms: *Clip parallelism* and *Dual-scope attention*.
1. **Clip Parallelism**: This mechanism optimizes the sharing of context information across GPUs by splitting the video latent into smaller clips and distributing them across multiple devices. It uses an interleaved communication strategy to minimize communication overhead, ensuring that all GPUs can effectively collaborate.
2. **Dual-scope Attention**: This mechanism modulates temporal self-attention to balance local and global contexts efficiently across devices. It adjusts the attention weights to capture both nearby frames (local context) and frames from across the entire video (global context), enhancing the coherence of the generated video.
By leveraging these mechanisms, Video-Infinity can generate videos up to 2,300 frames in length in approximately 5 minutes, significantly outperforming existing methods in terms of speed and video quality. The method is evaluated using the VBench tool, which measures various video quality metrics, and shows superior performance compared to baselines such as FreeNoise and Streaming T2V.
The paper also includes ablation studies to demonstrate the effectiveness of the proposed mechanisms and discusses the limitations of the approach, particularly the reliance on multiple GPUs and the handling of scene transitions. Overall, Video-Infinity sets a new benchmark for efficient and high-quality long-form video generation.Video-Infinity is a distributed inference pipeline designed to generate long-form videos using multiple GPUs. The primary challenges in producing longer videos include memory requirements and extended processing times on a single GPU. To address these issues, Video-Infinity introduces two key mechanisms: *Clip parallelism* and *Dual-scope attention*.
1. **Clip Parallelism**: This mechanism optimizes the sharing of context information across GPUs by splitting the video latent into smaller clips and distributing them across multiple devices. It uses an interleaved communication strategy to minimize communication overhead, ensuring that all GPUs can effectively collaborate.
2. **Dual-scope Attention**: This mechanism modulates temporal self-attention to balance local and global contexts efficiently across devices. It adjusts the attention weights to capture both nearby frames (local context) and frames from across the entire video (global context), enhancing the coherence of the generated video.
By leveraging these mechanisms, Video-Infinity can generate videos up to 2,300 frames in length in approximately 5 minutes, significantly outperforming existing methods in terms of speed and video quality. The method is evaluated using the VBench tool, which measures various video quality metrics, and shows superior performance compared to baselines such as FreeNoise and Streaming T2V.
The paper also includes ablation studies to demonstrate the effectiveness of the proposed mechanisms and discusses the limitations of the approach, particularly the reliance on multiple GPUs and the handling of scene transitions. Overall, Video-Infinity sets a new benchmark for efficient and high-quality long-form video generation.