[slides and audio] What Makes Multimodal In-Context Learning Work%3F

This paper investigates the effectiveness and limitations of Multimodal In-Context Learning (M-ICL) in large multimodal models (LMMs). The study uses the best open-source multimodal models, such as IDEFICS and OpenFlamingo, and a variety of multimodal tasks, including Visual Question Answering (VQA), captioning, and classification. Key findings include: 1. **Text-Driven Mechanisms**: M-ICL primarily relies on text-driven mechanisms, with images having little to no influence. 2. **Performance with Advanced Strategies**: Advanced M-ICL strategies, such as RICES, do not improve performance over simple majority voting over context examples. 3. **Biases and Limitations**: M-ICL suffers from biases, such as recency bias, where the model tends to copy the output of the last demonstration pair. The study also explores how different modalities affect M-ICL performance and identifies the importance of text in driving the model's decision-making process. Additionally, it examines the impact of similarity-based context selection methods and the role of recency bias. The findings highlight the need for better retrieval methods and the reduction of biases to improve the effectiveness of M-ICL.This paper investigates the effectiveness and limitations of Multimodal In-Context Learning (M-ICL) in large multimodal models (LMMs). The study uses the best open-source multimodal models, such as IDEFICS and OpenFlamingo, and a variety of multimodal tasks, including Visual Question Answering (VQA), captioning, and classification. Key findings include: 1. **Text-Driven Mechanisms**: M-ICL primarily relies on text-driven mechanisms, with images having little to no influence. 2. **Performance with Advanced Strategies**: Advanced M-ICL strategies, such as RICES, do not improve performance over simple majority voting over context examples. 3. **Biases and Limitations**: M-ICL suffers from biases, such as recency bias, where the model tends to copy the output of the last demonstration pair. The study also explores how different modalities affect M-ICL performance and identifies the importance of text in driving the model's decision-making process. Additionally, it examines the impact of similarity-based context selection methods and the role of recency bias. The findings highlight the need for better retrieval methods and the reduction of biases to improve the effectiveness of M-ICL.

What Makes Multimodal In-Context Learning Work?

25 Apr 2024 | Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski