T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

15 Jan 2025 | Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation This paper introduces T2V-CompBench, a comprehensive benchmark for compositional text-to-video generation. The benchmark includes seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We propose three types of evaluation metrics: MLLM-based, detection-based, and tracking-based metrics. These metrics are designed to better reflect the compositional text-to-video generation quality of the seven categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction. T2V-CompBench is designed to evaluate the ability of text-to-video generation models to compose multiple objects, attributes, actions, and motions into a video. The benchmark emphasizes compositionality through multiple objects with attributes, quantities, actions, interactions, and spatio-temporal dynamics. We design a prompt suite composed of seven categories, where each category consists of 200 text prompts for video generation. When constructing the prompts, we emphasize temporal dynamics and guarantee that each prompt contains at least one active verb. The seven categories are as follows and examples are illustrated in Figure 2: 1) Consistent attribute binding. This category includes prompts featuring two objects, each with a distinct attribute. The attributes associated with each object are consistent throughout the video. 2) Dynamic attribute binding. Prompts in this category focus on dynamic attribute binding for objects, where the attributes change over time. 3) Spatial relationships. In this category, each prompt mentions two objects and specifies the spatial relationship between them. 4) Motion binding. Each prompt in this category includes one or two objects and a moving direction is specified for each object. 5) Action binding. Prompts in this category describe two objects, each with a distinct action. 6) Object interactions. This category tests the models' abilities to understand and generate dynamic interactions between multiple objects, including physical interactions and social interactions. 7) Generative numeracy. The text prompts in this category include one or two objects with quantities ranging from one to eight. Another challenge lies in the evaluation of compositional T2V generation. Commonly used metrics, such as Inception Score, Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and CLIPScore, cannot fully reflect the compositionality of T2V generation models. Evaluating compositionality of T2V models requires a fine-grained understanding of not only objects and attributes in each frame but also the dynamics and motions across frames. ItT2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation This paper introduces T2V-CompBench, a comprehensive benchmark for compositional text-to-video generation. The benchmark includes seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We propose three types of evaluation metrics: MLLM-based, detection-based, and tracking-based metrics. These metrics are designed to better reflect the compositional text-to-video generation quality of the seven categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction. T2V-CompBench is designed to evaluate the ability of text-to-video generation models to compose multiple objects, attributes, actions, and motions into a video. The benchmark emphasizes compositionality through multiple objects with attributes, quantities, actions, interactions, and spatio-temporal dynamics. We design a prompt suite composed of seven categories, where each category consists of 200 text prompts for video generation. When constructing the prompts, we emphasize temporal dynamics and guarantee that each prompt contains at least one active verb. The seven categories are as follows and examples are illustrated in Figure 2: 1) Consistent attribute binding. This category includes prompts featuring two objects, each with a distinct attribute. The attributes associated with each object are consistent throughout the video. 2) Dynamic attribute binding. Prompts in this category focus on dynamic attribute binding for objects, where the attributes change over time. 3) Spatial relationships. In this category, each prompt mentions two objects and specifies the spatial relationship between them. 4) Motion binding. Each prompt in this category includes one or two objects and a moving direction is specified for each object. 5) Action binding. Prompts in this category describe two objects, each with a distinct action. 6) Object interactions. This category tests the models' abilities to understand and generate dynamic interactions between multiple objects, including physical interactions and social interactions. 7) Generative numeracy. The text prompts in this category include one or two objects with quantities ranging from one to eight. Another challenge lies in the evaluation of compositional T2V generation. Commonly used metrics, such as Inception Score, Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and CLIPScore, cannot fully reflect the compositionality of T2V generation models. Evaluating compositionality of T2V models requires a fine-grained understanding of not only objects and attributes in each frame but also the dynamics and motions across frames. It
Reach us at info@study.space
[slides and audio] T2V-CompBench%3A A Comprehensive Benchmark for Compositional Text-to-video Generation