25 May 2024 | Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang
TALC is a framework for generating multi-scene text-to-video (T2V) videos by aligning video scenes with their corresponding text descriptions. The framework enhances the text-conditioning mechanism in T2V models to recognize temporal alignment between scenes and descriptions, enabling the model to generate multi-scene videos that adhere to text descriptions and maintain visual consistency. TALC outperforms baseline methods on multi-scene video-text data by 15.5 points on aggregated scores, with higher text adherence and visual consistency as determined by human evaluation. The framework is trained using synthetic multi-scene video-text data generated from real-world videos. TALC is applied to pre-trained T2V models, such as ModelScope and Lumiere, and fine-tuned to improve performance. The framework is evaluated on multi-scene video generation tasks, including single character in multiple visual contexts, different characters in a specific visual context, and multi-scene captions from real videos. TALC achieves higher visual consistency and text adherence compared to baseline methods, demonstrating its effectiveness in generating high-quality multi-scene videos. The framework is designed to be adaptable to any diffusion-based T2V model and can be used for multi-scene video generation with improved text-to-video alignment.TALC is a framework for generating multi-scene text-to-video (T2V) videos by aligning video scenes with their corresponding text descriptions. The framework enhances the text-conditioning mechanism in T2V models to recognize temporal alignment between scenes and descriptions, enabling the model to generate multi-scene videos that adhere to text descriptions and maintain visual consistency. TALC outperforms baseline methods on multi-scene video-text data by 15.5 points on aggregated scores, with higher text adherence and visual consistency as determined by human evaluation. The framework is trained using synthetic multi-scene video-text data generated from real-world videos. TALC is applied to pre-trained T2V models, such as ModelScope and Lumiere, and fine-tuned to improve performance. The framework is evaluated on multi-scene video generation tasks, including single character in multiple visual contexts, different characters in a specific visual context, and multi-scene captions from real videos. TALC achieves higher visual consistency and text adherence compared to baseline methods, demonstrating its effectiveness in generating high-quality multi-scene videos. The framework is designed to be adaptable to any diffusion-based T2V model and can be used for multi-scene video generation with improved text-to-video alignment.