[slides and audio] TALC%3A Time-Aligned Captions for Multi-Scene Text-to-Video Generation

The paper introduces TALC (Time-Aligned Captions), a framework designed to enhance the generation of multi-scene videos from text prompts using pre-trained text-to-video (T2V) models. Traditional T2V models often produce single-scene videos, which are insufficient for real-world applications that require multi-scene videos. TALC addresses this by improving the text-conditioning mechanism in T2V models to recognize and align the temporal alignment between video scenes and scene descriptions. This alignment ensures that the generated videos adhere to the multi-scene text descriptions and maintain visual consistency across different scenes. The authors evaluate TALC using two T2V models, ModelScope and Lumiere, and compare their performance with baseline methods that either merge scene descriptions or generate videos for each scene independently. The results show that TALC outperforms these baselines by 15.5 points on the aggregated score, which averages visual consistency and text adherence, as determined by human evaluation. Additionally, TALC-finetuned models achieve higher text adherence and visual consistency in multi-scene scenarios. The paper also discusses the limitations of the current approach, such as the need for more diverse and extensive multi-scene video-text data, and suggests future directions for improving the model's performance and addressing societal biases in generated content.The paper introduces TALC (Time-Aligned Captions), a framework designed to enhance the generation of multi-scene videos from text prompts using pre-trained text-to-video (T2V) models. Traditional T2V models often produce single-scene videos, which are insufficient for real-world applications that require multi-scene videos. TALC addresses this by improving the text-conditioning mechanism in T2V models to recognize and align the temporal alignment between video scenes and scene descriptions. This alignment ensures that the generated videos adhere to the multi-scene text descriptions and maintain visual consistency across different scenes. The authors evaluate TALC using two T2V models, ModelScope and Lumiere, and compare their performance with baseline methods that either merge scene descriptions or generate videos for each scene independently. The results show that TALC outperforms these baselines by 15.5 points on the aggregated score, which averages visual consistency and text adherence, as determined by human evaluation. Additionally, TALC-finetuned models achieve higher text adherence and visual consistency in multi-scene scenarios. The paper also discusses the limitations of the current approach, such as the need for more diverse and extensive multi-scene video-text data, and suggests future directions for improving the model's performance and addressing societal biases in generated content.

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

25 May 2024 | Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang