15 Jan 2025 | Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu
T2V-CompBench is a comprehensive benchmark designed for compositional text-to-video generation, aiming to evaluate models' ability to compose complex scenes with multiple objects, attributes, actions, and motions. The benchmark consists of seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. Each category includes 200 text prompts, totaling 1400 prompts. The evaluation metrics are designed to reflect the quality of compositional text-to-video generation, including MLLM-based, detection-based, and tracking-based metrics. Various text-to-video models are benchmarked, and the results reveal that current models face significant challenges in compositional text-to-video generation. The study provides insights into the strengths and weaknesses of different models and highlights the need for further research in this area.T2V-CompBench is a comprehensive benchmark designed for compositional text-to-video generation, aiming to evaluate models' ability to compose complex scenes with multiple objects, attributes, actions, and motions. The benchmark consists of seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. Each category includes 200 text prompts, totaling 1400 prompts. The evaluation metrics are designed to reflect the quality of compositional text-to-video generation, including MLLM-based, detection-based, and tracking-based metrics. Various text-to-video models are benchmarked, and the results reveal that current models face significant challenges in compositional text-to-video generation. The study provides insights into the strengths and weaknesses of different models and highlights the need for further research in this area.