CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

7 Mar 2024 | Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao
This paper introduces CAT, a novel multimodal large language model (MLLM) designed to enhance the ability of LLMs to answer questions in dynamic audio-visual scenarios. The main challenges addressed include the ambiguity in responses and the difficulty of aligning LLMs with cross-domain data during training on large-scale multimodal corpora. To overcome these challenges, CAT is enhanced in three ways: 1) a clue aggregator is designed to aggregate question-related clues in dynamic audio-visual scenarios, enriching the detailed knowledge required for LLMs. 2) CAT is trained on a mixed multimodal dataset, including an audio-visual joint instruction dataset named AVinstruct, to further enhance its ability to model cross-semantic correlations. 3) an AI-assisted ambiguity-aware direct preference optimization (ADPO) strategy is proposed, which retrain the model to favor non-ambiguous responses and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in AudioVisual Question Answering (AVQA) tasks. The code and dataset are available at https://github.com/rikeilong/Bay-CAT. The paper also discusses related works, including AVQA, MLLMs for AVQA, and human-preference learning. The approach includes multimodal inputs, a clue aggregator, and ADPO strategy. The experiments show that CAT achieves state-of-the-art results on various tasks, including video-based text generation, zero-shot video question answering, and open-ended AVQA tasks. The paper concludes that CAT enhances LLMs' multimodal understanding in dynamic audio-visual scenarios and proposes future work for more comprehensive applications.This paper introduces CAT, a novel multimodal large language model (MLLM) designed to enhance the ability of LLMs to answer questions in dynamic audio-visual scenarios. The main challenges addressed include the ambiguity in responses and the difficulty of aligning LLMs with cross-domain data during training on large-scale multimodal corpora. To overcome these challenges, CAT is enhanced in three ways: 1) a clue aggregator is designed to aggregate question-related clues in dynamic audio-visual scenarios, enriching the detailed knowledge required for LLMs. 2) CAT is trained on a mixed multimodal dataset, including an audio-visual joint instruction dataset named AVinstruct, to further enhance its ability to model cross-semantic correlations. 3) an AI-assisted ambiguity-aware direct preference optimization (ADPO) strategy is proposed, which retrain the model to favor non-ambiguous responses and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in AudioVisual Question Answering (AVQA) tasks. The code and dataset are available at https://github.com/rikeilong/Bay-CAT. The paper also discusses related works, including AVQA, MLLMs for AVQA, and human-preference learning. The approach includes multimodal inputs, a clue aggregator, and ADPO strategy. The experiments show that CAT achieves state-of-the-art results on various tasks, including video-based text generation, zero-shot video question answering, and open-ended AVQA tasks. The paper concludes that CAT enhances LLMs' multimodal understanding in dynamic audio-visual scenarios and proposes future work for more comprehensive applications.
Reach us at info@study.space
[slides] CAT%3A Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | StudySpace