Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

1 Jul 2024 | Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, Jingdong Wang
This paper proposes a new evaluation protocol, DEVIL, for text-to-video (T2V) generation models, focusing on the dynamics dimension. Existing evaluation protocols primarily focus on temporal consistency and content continuity, but neglect the dynamics of video content. Dynamics are essential for measuring the visual vividness and honesty of video content to text prompts. The proposed DEVIL protocol evaluates T2V models based on three metrics: dynamics range, dynamics controllability, and dynamics-based quality. The protocol includes a new benchmark of text prompts that reflect multiple dynamics grades and a set of dynamics scores corresponding to various temporal granularities. The benchmark and scores are used to assess T2V models with the three metrics. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. The paper also discusses the limitations of existing datasets and training methods, and suggests that more elaborate training data with better methods will improve T2V performance on both quality and dynamics scores. The results show that existing models tend to generate low-dynamic videos to achieve higher scores, indicating a need for better dynamics control in T2V generation. The paper also introduces a naturalness metric based on a multimodal large language model to evaluate video naturalness. The proposed protocol provides a more comprehensive evaluation of T2V models by considering dynamics, which is an essential dimension for video generation.This paper proposes a new evaluation protocol, DEVIL, for text-to-video (T2V) generation models, focusing on the dynamics dimension. Existing evaluation protocols primarily focus on temporal consistency and content continuity, but neglect the dynamics of video content. Dynamics are essential for measuring the visual vividness and honesty of video content to text prompts. The proposed DEVIL protocol evaluates T2V models based on three metrics: dynamics range, dynamics controllability, and dynamics-based quality. The protocol includes a new benchmark of text prompts that reflect multiple dynamics grades and a set of dynamics scores corresponding to various temporal granularities. The benchmark and scores are used to assess T2V models with the three metrics. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. The paper also discusses the limitations of existing datasets and training methods, and suggests that more elaborate training data with better methods will improve T2V performance on both quality and dynamics scores. The results show that existing models tend to generate low-dynamic videos to achieve higher scores, indicating a need for better dynamics control in T2V generation. The paper also introduces a naturalness metric based on a multimodal large language model to evaluate video naturalness. The proposed protocol provides a more comprehensive evaluation of T2V models by considering dynamics, which is an essential dimension for video generation.
Reach us at info@study.space
[slides and audio] Evaluation of Text-to-Video Generation Models%3A A Dynamics Perspective