Towards A Better Metric for Text-to-Video Generation

Towards A Better Metric for Text-to-Video Generation

15 Jan 2024 | Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
The paper introduces a novel evaluation metric, Text-to-Video Score (T2VScore), designed to assess the quality of text-to-video generation. T2VScore integrates two key criteria: Text-Video Alignment and Video Quality. The Text-Video Alignment metric, T2VScore-A, evaluates how well the video matches the text prompt through visual question answering (VQA) with large language models (LLMs). The Video Quality metric, T2VScore-Q, assesses the overall production quality of the video using a combination of structural and training strategies. To validate the effectiveness of these metrics, the authors present the Text-to-Video Generation Evaluation (TVGE) dataset, which collects human judgments on 2,543 text-to-video generated videos. Experiments on the TVGE dataset demonstrate that T2VScore outperforms existing metrics in terms of correlation with human judgment. The paper also discusses the limitations of current metrics and the challenges in evaluating video content, particularly in the temporal domain. The proposed metrics and dataset aim to provide a more comprehensive and reliable means of assessing text-to-video generation.The paper introduces a novel evaluation metric, Text-to-Video Score (T2VScore), designed to assess the quality of text-to-video generation. T2VScore integrates two key criteria: Text-Video Alignment and Video Quality. The Text-Video Alignment metric, T2VScore-A, evaluates how well the video matches the text prompt through visual question answering (VQA) with large language models (LLMs). The Video Quality metric, T2VScore-Q, assesses the overall production quality of the video using a combination of structural and training strategies. To validate the effectiveness of these metrics, the authors present the Text-to-Video Generation Evaluation (TVGE) dataset, which collects human judgments on 2,543 text-to-video generated videos. Experiments on the TVGE dataset demonstrate that T2VScore outperforms existing metrics in terms of correlation with human judgment. The paper also discusses the limitations of current metrics and the challenges in evaluating video content, particularly in the temporal domain. The proposed metrics and dataset aim to provide a more comprehensive and reliable means of assessing text-to-video generation.
Reach us at info@study.space