This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. The authors evaluate the zero-shot generalization of 14 MLLMs across various datasets, including synthetic images, real-world distribution shifts, and specialized datasets like medical and molecular imagery. The results indicate that MLLMs struggle with generalization beyond their training domains, highlighting the need for adaptation. To understand the causes of this unreliability, the authors analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. The analysis identifies mapping deficiency as the primary issue. To address this, the authors explore the potential of in-context learning (ICL) to enhance MLLMs' generalization. They find that incorporating in-context examples (ICE) from both target and biased distributions significantly improves generalization, but also show that ICL is vulnerable to domain shifts, label shifts, and spurious correlation shifts. The paper contributes by evaluating MLLMs' zero-shot generalization, investigating scaling laws, and identifying the primary hindrance to generalization. It also demonstrates the effectiveness of ICL under different distribution shifts, while noting its limitations in certain scenarios.This paper investigates the generalization capabilities of current Multimodal Large Language Models (MLLMs) under out-of-distribution (OOD) scenarios and domain-specific tasks. The authors evaluate the zero-shot generalization of 14 MLLMs across various datasets, including synthetic images, real-world distribution shifts, and specialized datasets like medical and molecular imagery. The results indicate that MLLMs struggle with generalization beyond their training domains, highlighting the need for adaptation. To understand the causes of this unreliability, the authors analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. The analysis identifies mapping deficiency as the primary issue. To address this, the authors explore the potential of in-context learning (ICL) to enhance MLLMs' generalization. They find that incorporating in-context examples (ICE) from both target and biased distributions significantly improves generalization, but also show that ICL is vulnerable to domain shifts, label shifts, and spurious correlation shifts. The paper contributes by evaluating MLLMs' zero-shot generalization, investigating scaling laws, and identifying the primary hindrance to generalization. It also demonstrates the effectiveness of ICL under different distribution shifts, while noting its limitations in certain scenarios.