Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

1 Jul 2024 | Mingxiang Liao1*, Hannan Lu2*, Xinyu Zhang3,4*, Fang Wan1 Tianyu Wang1 Yuzhong Zhao1 Wangmeng Zuo2 Qixiang Ye1† Jingdong Wang4
The paper introduces a novel evaluation protocol for text-to-video (T2V) generation models, named DEVIL, which focuses on the dynamics dimension to comprehensively assess the quality and realism of generated videos. DEVIL establishes a new benchmark with text prompts categorized by multiple dynamics grades and defines a set of dynamics scores at different temporal granularities. The protocol includes three main metrics: *dynamics range*, *dynamics controllability*, and *dynamics-based quality*. The dynamics range measures the extent of variations in video content, dynamics controllability assesses the model's ability to manipulate video dynamics, and dynamics-based quality evaluates the visual quality of videos with varying dynamics. Experiments show that DEVIL achieves a Pearson correlation of over 90% with human ratings, demonstrating its effectiveness in advancing T2V generation models. The paper also highlights the limitations of existing datasets and methods in generating high-dynamic videos, suggesting that more elaborate training data and methods are needed to improve T2V performance.The paper introduces a novel evaluation protocol for text-to-video (T2V) generation models, named DEVIL, which focuses on the dynamics dimension to comprehensively assess the quality and realism of generated videos. DEVIL establishes a new benchmark with text prompts categorized by multiple dynamics grades and defines a set of dynamics scores at different temporal granularities. The protocol includes three main metrics: *dynamics range*, *dynamics controllability*, and *dynamics-based quality*. The dynamics range measures the extent of variations in video content, dynamics controllability assesses the model's ability to manipulate video dynamics, and dynamics-based quality evaluates the visual quality of videos with varying dynamics. Experiments show that DEVIL achieves a Pearson correlation of over 90% with human ratings, demonstrating its effectiveness in advancing T2V generation models. The paper also highlights the limitations of existing datasets and methods in generating high-dynamic videos, suggesting that more elaborate training data and methods are needed to improve T2V performance.
Reach us at info@study.space