2 Apr 2024 | Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang
This paper introduces a novel framework for direct preference optimization (DPO) in video large multimodal models (LMMs), using detailed video captions as a proxy for video content. The framework enables language models to incorporate this information as supporting evidence for scoring video question answering (QA) predictions. The approach demonstrates robust alignment with the reward mechanism of the OpenAI GPT-4V model, which directly takes video frames as input. The study shows that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.
The paper addresses the challenge of aligning LMMs, particularly in tasks involving video instruction following. Despite recent advancements in reinforcement learning (RL) and DPO, their effectiveness in multimodal contexts remains limited. The critical obstacle lies in developing a robust reward system capable of distinguishing preferred responses from less preferred ones, especially when such responses are generated based on video inputs. The challenge is further complicated by the presence of hallucinations in generated content, stemming from the scarcity of alignment data across different modalities.
To tackle these challenges, the paper introduces a cost-effective reward mechanism aimed at reliably evaluating the quality of responses generated by video LLMs. The method leverages detailed video captions as a proxy for video content, enabling a language model to analyze video content and assess the accuracy of an LMM's response to a related question and determine the presence of hallucinations. The language model provides natural language feedback as a chain-of-thought step, and generates a numerical score for reward, facilitating a cost-effective feedback system.
The paper develops a large-scale, detailed video caption dataset, SHAREGPTVIDEO, using a novel prompting technique with the GPT-4V model, comprising 900k captions that encompass a wide range of video content. With this video caption dataset available, the paper verifies that the reward mechanism, which utilizes video captions as a proxy, is well-aligned with evaluations derived from the more powerful, albeit costlier, GPT-4V model-generated rewards. Employing this reward mechanism as the basis for the DPO algorithm, the paper trains LLAVA-HOUND-DPO that achieves an 8.1% accuracy improvement over the SFT counterpart.
The paper's contributions include: (1) developing a large-scale, detailed video caption dataset, (2) introducing a cost-effective method for evaluating video instruction-following tasks, and (3) demonstrating the effective application of DPO to improve model performance by leveraging language model feedback as reward. The paper also presents experimental results showing that the proposed method achieves superior performance on video QA tasks compared to existing benchmarks. The paper concludes that the proposed method provides a new benchmark for SOTA performance in video QA tasks.This paper introduces a novel framework for direct preference optimization (DPO) in video large multimodal models (LMMs), using detailed video captions as a proxy for video content. The framework enables language models to incorporate this information as supporting evidence for scoring video question answering (QA) predictions. The approach demonstrates robust alignment with the reward mechanism of the OpenAI GPT-4V model, which directly takes video frames as input. The study shows that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.
The paper addresses the challenge of aligning LMMs, particularly in tasks involving video instruction following. Despite recent advancements in reinforcement learning (RL) and DPO, their effectiveness in multimodal contexts remains limited. The critical obstacle lies in developing a robust reward system capable of distinguishing preferred responses from less preferred ones, especially when such responses are generated based on video inputs. The challenge is further complicated by the presence of hallucinations in generated content, stemming from the scarcity of alignment data across different modalities.
To tackle these challenges, the paper introduces a cost-effective reward mechanism aimed at reliably evaluating the quality of responses generated by video LLMs. The method leverages detailed video captions as a proxy for video content, enabling a language model to analyze video content and assess the accuracy of an LMM's response to a related question and determine the presence of hallucinations. The language model provides natural language feedback as a chain-of-thought step, and generates a numerical score for reward, facilitating a cost-effective feedback system.
The paper develops a large-scale, detailed video caption dataset, SHAREGPTVIDEO, using a novel prompting technique with the GPT-4V model, comprising 900k captions that encompass a wide range of video content. With this video caption dataset available, the paper verifies that the reward mechanism, which utilizes video captions as a proxy, is well-aligned with evaluations derived from the more powerful, albeit costlier, GPT-4V model-generated rewards. Employing this reward mechanism as the basis for the DPO algorithm, the paper trains LLAVA-HOUND-DPO that achieves an 8.1% accuracy improvement over the SFT counterpart.
The paper's contributions include: (1) developing a large-scale, detailed video caption dataset, (2) introducing a cost-effective method for evaluating video instruction-following tasks, and (3) demonstrating the effective application of DPO to improve model performance by leveraging language model feedback as reward. The paper also presents experimental results showing that the proposed method achieves superior performance on video QA tasks compared to existing benchmarks. The paper concludes that the proposed method provides a new benchmark for SOTA performance in video QA tasks.