7 Mar 2024 | Qilang Ye1, Zitong Yu1*, Rui Shao2, Xinyu Xie1, Philip Torr3, and Xiaochun Cao4
This paper addresses the challenge of answering questions in dynamic audio-visual scenarios using Multimodal Large Language Models (MLLMs). Existing MLLMs can respond to audio-visual content but often provide ambiguous or incomplete answers. To overcome this, the authors introduce CAT, a novel model that enhances MLLMs in three key ways:
1. **Clue Aggregator**: CAT designs a clue aggregator to aggregate question-related clues from dynamic audio-visual scenarios, enriching the detailed knowledge required for large language models.
2. **Mixed Multimodal Training**: CAT is trained on a mixed multimodal dataset, allowing it to directly apply in audio-visual scenarios. The authors collect an audio-visual joint instruction dataset named AVInstruct to further enhance CAT's capacity to model cross-semantic correlations.
3. **AI-assisted Ambiguity-aware Direct Preference Optimization (ADPO)**: CAT proposes a strategy to retrain the model to favor non-ambiguous responses and improve the ability to localize specific audio-visual objects.
Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, particularly in Audio-Visual Question Answering (AVQA) tasks. The paper also includes a detailed introduction to the problem, related works, and a comprehensive evaluation of CAT's performance on various datasets and tasks.This paper addresses the challenge of answering questions in dynamic audio-visual scenarios using Multimodal Large Language Models (MLLMs). Existing MLLMs can respond to audio-visual content but often provide ambiguous or incomplete answers. To overcome this, the authors introduce CAT, a novel model that enhances MLLMs in three key ways:
1. **Clue Aggregator**: CAT designs a clue aggregator to aggregate question-related clues from dynamic audio-visual scenarios, enriching the detailed knowledge required for large language models.
2. **Mixed Multimodal Training**: CAT is trained on a mixed multimodal dataset, allowing it to directly apply in audio-visual scenarios. The authors collect an audio-visual joint instruction dataset named AVInstruct to further enhance CAT's capacity to model cross-semantic correlations.
3. **AI-assisted Ambiguity-aware Direct Preference Optimization (ADPO)**: CAT proposes a strategy to retrain the model to favor non-ambiguous responses and improve the ability to localize specific audio-visual objects.
Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, particularly in Audio-Visual Question Answering (AVQA) tasks. The paper also includes a detailed introduction to the problem, related works, and a comprehensive evaluation of CAT's performance on various datasets and tasks.