[slides] On the Out-Of-Distribution Generalization of Multimodal Large Language Models

This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. The authors evaluate the zero-shot generalization of 14 MLLMs across various datasets, including synthetic images, real-world distribution shifts, and specialized datasets like medical and molecular imagery. The results indicate that MLLMs struggle with generalization beyond their training domains, highlighting the need for adaptation. To understand the causes of this unreliability, the authors analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. The analysis identifies mapping deficiency as the primary issue. To address this, the authors explore the potential of in-context learning (ICL) to enhance MLLMs' generalization. They find that incorporating in-context examples (ICE) from both target and biased distributions significantly improves generalization, but also show that ICL is vulnerable to domain shifts, label shifts, and spurious correlation shifts. The paper contributes by evaluating MLLMs' zero-shot generalization, investigating scaling laws, and identifying the primary hindrance to generalization. It also demonstrates the effectiveness of ICL under different distribution shifts, while noting its limitations in certain scenarios.This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. The authors evaluate the zero-shot generalization of 14 MLLMs across various datasets, including synthetic images, real-world distribution shifts, and specialized datasets like medical and molecular imagery. The results indicate that MLLMs struggle with generalization beyond their training domains, highlighting the need for adaptation. To understand the causes of this unreliability, the authors analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. The analysis identifies mapping deficiency as the primary issue. To address this, the authors explore the potential of in-context learning (ICL) to enhance MLLMs' generalization. They find that incorporating in-context examples (ICE) from both target and biased distributions significantly improves generalization, but also show that ICL is vulnerable to domain shifts, label shifts, and spurious correlation shifts. The paper contributes by evaluating MLLMs' zero-shot generalization, investigating scaling laws, and identifying the primary hindrance to generalization. It also demonstrates the effectiveness of ICL under different distribution shifts, while noting its limitations in certain scenarios.

On the Out-Of-Distribution Generalization of Multimodal Large Language Models

9 Feb 2024 | Xingxuan Zhang†1, Jiansheng Li†1, Wenjing Chu2, Junjia Hai2, Renzhe Xu1, Yuqing Yang1, Shikai Guan1, Jiazheng Xu1, and Peng Cui*1