On the Out-Of-Distribution Generalization of Multimodal Large Language Models

On the Out-Of-Distribution Generalization of Multimodal Large Language Models

9 Feb 2024 | Xingxuan Zhang, Jiansheng Li, Wenjing Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazhen Xu, and Peng Cui
This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results show that MLLMs struggle with generalization beyond common training domains, limiting their direct application without adaptation. We analyze three hypotheses for their unreliable performance: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. Results identify mapping deficiency as the primary hurdle. To address this, we show that in-context learning (ICL) can significantly enhance MLLMs' generalization, opening new avenues for overcoming generalization barriers. We further explore the robustness of ICL under distribution shifts and show its vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data. We evaluate the zero-shot generalization of 14 MLLMs on 20 datasets under various distributional shifts. We demonstrate that OOD generalization performance of MLLMs can significantly diverge from their performance on current public benchmarks. We investigate the scaling law of MLLMs on OOD generalization ability for potential issues of visual feature extraction failure caused by high data complexity within visual input. Our analysis identifies that mapping deficiency, other than semantic misinterpretation within text input and data complexity within visual input, can be the primary hindrance to model generalization. We validate the potential of ICL to enhance the model’s acquisition and utilization of critical relationships between semantic descriptions and visual features with ICE both from the target distribution and biased distributions.This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results show that MLLMs struggle with generalization beyond common training domains, limiting their direct application without adaptation. We analyze three hypotheses for their unreliable performance: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. Results identify mapping deficiency as the primary hurdle. To address this, we show that in-context learning (ICL) can significantly enhance MLLMs' generalization, opening new avenues for overcoming generalization barriers. We further explore the robustness of ICL under distribution shifts and show its vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data. We evaluate the zero-shot generalization of 14 MLLMs on 20 datasets under various distributional shifts. We demonstrate that OOD generalization performance of MLLMs can significantly diverge from their performance on current public benchmarks. We investigate the scaling law of MLLMs on OOD generalization ability for potential issues of visual feature extraction failure caused by high data complexity within visual input. Our analysis identifies that mapping deficiency, other than semantic misinterpretation within text input and data complexity within visual input, can be the primary hindrance to model generalization. We validate the potential of ICL to enhance the model’s acquisition and utilization of critical relationships between semantic descriptions and visual features with ICE both from the target distribution and biased distributions.
Reach us at info@study.space
[slides and audio] On the Out-Of-Distribution Generalization of Multimodal Large Language Models