25 Apr 2024 | Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski
This paper investigates the effectiveness and limitations of Multimodal In-Context Learning (M-ICL) in large multimodal models (LMMs). The study uses the best open-source multimodal models, such as IDEFICS and OpenFlamingo, and a variety of multimodal tasks, including Visual Question Answering (VQA), captioning, and classification. Key findings include:
1. **Text-Driven Mechanisms**: M-ICL primarily relies on text-driven mechanisms, with images having little to no influence.
2. **Performance with Advanced Strategies**: Advanced M-ICL strategies, such as RICES, do not improve performance over simple majority voting over context examples.
3. **Biases and Limitations**: M-ICL suffers from biases, such as recency bias, where the model tends to copy the output of the last demonstration pair.
The study also explores how different modalities affect M-ICL performance and identifies the importance of text in driving the model's decision-making process. Additionally, it examines the impact of similarity-based context selection methods and the role of recency bias. The findings highlight the need for better retrieval methods and the reduction of biases to improve the effectiveness of M-ICL.This paper investigates the effectiveness and limitations of Multimodal In-Context Learning (M-ICL) in large multimodal models (LMMs). The study uses the best open-source multimodal models, such as IDEFICS and OpenFlamingo, and a variety of multimodal tasks, including Visual Question Answering (VQA), captioning, and classification. Key findings include:
1. **Text-Driven Mechanisms**: M-ICL primarily relies on text-driven mechanisms, with images having little to no influence.
2. **Performance with Advanced Strategies**: Advanced M-ICL strategies, such as RICES, do not improve performance over simple majority voting over context examples.
3. **Biases and Limitations**: M-ICL suffers from biases, such as recency bias, where the model tends to copy the output of the last demonstration pair.
The study also explores how different modalities affect M-ICL performance and identifies the importance of text in driving the model's decision-making process. Additionally, it examines the impact of similarity-based context selection methods and the role of recency bias. The findings highlight the need for better retrieval methods and the reduction of biases to improve the effectiveness of M-ICL.