2 Apr 2024 | Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang
This paper addresses the challenge of aligning large multimodal models (LMMs) in video instruction-following tasks by introducing a novel framework that uses detailed video captions as a proxy for video content. The approach leverages language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. The framework demonstrates robust alignment with OpenAI GPT-4V's reward mechanism, which directly takes video frames as input. The application of this tailored reward through direct preference optimization (DPO) significantly improves the performance of video LMMs on video QA tasks, achieving an 8.1% accuracy improvement over the self-training (SFT) counterpart. The paper also introduces a comprehensive video caption dataset, SHAREGPTVIDEO, to address the scarcity of high-quality video captions. The proposed method is cost-effective, requiring less than $20 for data collection, compared to the $3000 cost of human-evaluated data. The evaluation shows that the proposed reward system is well-aligned with GPT-4V's evaluations, and the DPO-trained model outperforms existing baselines in video QA tasks. The paper concludes with a discussion on the generalization potential of the model across different datasets and suggests future improvements, such as finding better hyperparameters and refining the benchmark evaluation.This paper addresses the challenge of aligning large multimodal models (LMMs) in video instruction-following tasks by introducing a novel framework that uses detailed video captions as a proxy for video content. The approach leverages language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. The framework demonstrates robust alignment with OpenAI GPT-4V's reward mechanism, which directly takes video frames as input. The application of this tailored reward through direct preference optimization (DPO) significantly improves the performance of video LMMs on video QA tasks, achieving an 8.1% accuracy improvement over the self-training (SFT) counterpart. The paper also introduces a comprehensive video caption dataset, SHAREGPTVIDEO, to address the scarcity of high-quality video captions. The proposed method is cost-effective, requiring less than $20 for data collection, compared to the $3000 cost of human-evaluated data. The evaluation shows that the proposed reward system is well-aligned with GPT-4V's evaluations, and the DPO-trained model outperforms existing baselines in video QA tasks. The paper concludes with a discussion on the generalization potential of the model across different datasets and suggests future improvements, such as finding better hyperparameters and refining the benchmark evaluation.