Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

October 28–November 1, 2024 | Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkou Min, Guangtao Zhai, Ning Liu
This paper introduces T2VQA-DB, the largest-scale Text-to-Video Quality Assessment Database, containing 10,000 videos generated by 9 different Text-to-Video (T2V) models, along with Mean Opinion Scores (MOS) from 27 subjects. The dataset is designed to address the lack of quantitative evaluation methods for text-generated videos. Based on T2VQA-DB, the authors propose T2VQA, a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment. T2VQA extracts features from text-video alignment and video fidelity perspectives, then leverages a large language model to predict video quality. Experimental results show that T2VQA outperforms existing T2V metrics and state-of-the-art video quality assessment models. The dataset and code are available at https://github.com/QMME/T2VQA. The contributions include establishing the largest-scale T2V dataset, proposing a novel model for text-to-video quality assessment, and demonstrating the effectiveness of T2VQA in measuring the perceptual quality of text-generated videos. The paper also discusses related works, including existing T2V datasets, metrics for T2V generation, and text-to-video generation methods. The results show that T2VQA achieves the best performance in terms of correlation with subjective scores and generalization across different datasets. The model is effective in capturing both text-video alignment and video fidelity, and it outperforms other models in terms of quality prediction. The paper concludes that T2VQA is an effective method for evaluating the quality of text-generated videos.This paper introduces T2VQA-DB, the largest-scale Text-to-Video Quality Assessment Database, containing 10,000 videos generated by 9 different Text-to-Video (T2V) models, along with Mean Opinion Scores (MOS) from 27 subjects. The dataset is designed to address the lack of quantitative evaluation methods for text-generated videos. Based on T2VQA-DB, the authors propose T2VQA, a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment. T2VQA extracts features from text-video alignment and video fidelity perspectives, then leverages a large language model to predict video quality. Experimental results show that T2VQA outperforms existing T2V metrics and state-of-the-art video quality assessment models. The dataset and code are available at https://github.com/QMME/T2VQA. The contributions include establishing the largest-scale T2V dataset, proposing a novel model for text-to-video quality assessment, and demonstrating the effectiveness of T2VQA in measuring the perceptual quality of text-generated videos. The paper also discusses related works, including existing T2V datasets, metrics for T2V generation, and text-to-video generation methods. The results show that T2VQA achieves the best performance in terms of correlation with subjective scores and generalization across different datasets. The model is effective in capturing both text-video alignment and video fidelity, and it outperforms other models in terms of quality prediction. The paper concludes that T2VQA is an effective method for evaluating the quality of text-generated videos.
Reach us at info@study.space