25 Apr 2024 | Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski
This paper investigates the effectiveness of Multimodal In-Context Learning (M-ICL) in Large Multimodal Models (LMMs). The study reveals that M-ICL primarily relies on text-driven mechanisms, with limited influence from the image modality. When using advanced ICL strategies like RICES, M-ICL does not outperform simple majority voting over context examples. The research also identifies biases and limitations in M-ICL that should be considered before deployment. The study uses the best open-source LMMs, such as IDEFICS and OpenFlamingo, and a wide range of multimodal tasks. The findings show that images play a crucial role in image-to-text tasks, while text is more influential in tasks involving both image and text. For advanced similarity-based context selection methods, LMMs do not perform better than majority voting. Additionally, M-ICL suffers from recency bias, where the model tends to copy the answer of the last example in the context. The study also shows that M-ICL can be improved by using retrieval-based context selection, but it still relies on shortcuts rather than genuine learning. The results suggest that text is more important than images in M-ICL, and that the model's performance is influenced by the number of demonstrations and the similarity between the query and the context. The study concludes that while images do have an impact on M-ICL, textual information takes precedence and drives the model's decision-making process. The research highlights the need for further improvements in M-ICL, including better retrieval methods and reducing biases like recency bias.This paper investigates the effectiveness of Multimodal In-Context Learning (M-ICL) in Large Multimodal Models (LMMs). The study reveals that M-ICL primarily relies on text-driven mechanisms, with limited influence from the image modality. When using advanced ICL strategies like RICES, M-ICL does not outperform simple majority voting over context examples. The research also identifies biases and limitations in M-ICL that should be considered before deployment. The study uses the best open-source LMMs, such as IDEFICS and OpenFlamingo, and a wide range of multimodal tasks. The findings show that images play a crucial role in image-to-text tasks, while text is more influential in tasks involving both image and text. For advanced similarity-based context selection methods, LMMs do not perform better than majority voting. Additionally, M-ICL suffers from recency bias, where the model tends to copy the answer of the last example in the context. The study also shows that M-ICL can be improved by using retrieval-based context selection, but it still relies on shortcuts rather than genuine learning. The results suggest that text is more important than images in M-ICL, and that the model's performance is influenced by the number of demonstrations and the similarity between the query and the context. The study concludes that while images do have an impact on M-ICL, textual information takes precedence and drives the model's decision-making process. The research highlights the need for further improvements in M-ICL, including better retrieval methods and reducing biases like recency bias.