2024 | Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu
This paper addresses the lack of a quantitative method to evaluate the quality of text-generated videos (T2V). To tackle this issue, the authors establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB), containing 10,000 videos generated by 9 different T2V models, along with their corresponding Mean Opinion Scores (MOS) from 27 subjects. Based on this dataset, they propose a novel transformer-based model, T2VQA, which evaluates video quality from two perspectives: text-video alignment and video fidelity. The model uses a large language model (LLM) to regress the final prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and state-of-the-art video quality assessment models, validating its effectiveness in measuring the perceptual quality of text-generated videos. The dataset and code are available at https://github.com/QMME/T2VQA.This paper addresses the lack of a quantitative method to evaluate the quality of text-generated videos (T2V). To tackle this issue, the authors establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB), containing 10,000 videos generated by 9 different T2V models, along with their corresponding Mean Opinion Scores (MOS) from 27 subjects. Based on this dataset, they propose a novel transformer-based model, T2VQA, which evaluates video quality from two perspectives: text-video alignment and video fidelity. The model uses a large language model (LLM) to regress the final prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and state-of-the-art video quality assessment models, validating its effectiveness in measuring the perceptual quality of text-generated videos. The dataset and code are available at https://github.com/QMME/T2VQA.