15 Jan 2024 | Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
This paper introduces T2VScore, a novel evaluation metric for text-to-video generation that addresses the limitations of existing metrics. Current metrics like FVD, IS, and CLIP Score are insufficient for evaluating video content due to their focus on static analysis and lack of temporal assessment. T2VScore evaluates two key aspects: text-video alignment and video quality. Text-video alignment assesses how well the video matches the text description, while video quality evaluates the overall production quality using a combination of experts. The proposed metric is validated using the TVGE dataset, which collects human judgments on 2,543 text-to-video generated videos. Experiments on the TVGE dataset demonstrate that T2VScore provides a more accurate and reliable evaluation of text-to-video generation compared to existing metrics. The code and dataset are open-sourced to facilitate further research and improvements. The paper also discusses related work in text-to-video generation and evaluation metrics, highlighting the challenges and limitations of current approaches. The proposed T2VScore aims to provide a more comprehensive and effective evaluation framework for text-to-video generation.This paper introduces T2VScore, a novel evaluation metric for text-to-video generation that addresses the limitations of existing metrics. Current metrics like FVD, IS, and CLIP Score are insufficient for evaluating video content due to their focus on static analysis and lack of temporal assessment. T2VScore evaluates two key aspects: text-video alignment and video quality. Text-video alignment assesses how well the video matches the text description, while video quality evaluates the overall production quality using a combination of experts. The proposed metric is validated using the TVGE dataset, which collects human judgments on 2,543 text-to-video generated videos. Experiments on the TVGE dataset demonstrate that T2VScore provides a more accurate and reliable evaluation of text-to-video generation compared to existing metrics. The code and dataset are open-sourced to facilitate further research and improvements. The paper also discusses related work in text-to-video generation and evaluation metrics, highlighting the challenges and limitations of current approaches. The proposed T2VScore aims to provide a more comprehensive and effective evaluation framework for text-to-video generation.