Visual In-Context Learning for Large Vision-Language Models

Visual In-Context Learning for Large Vision-Language Models

18 Feb 2024 | Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen
This paper introduces Visual In-Context Learning (VICL) for Large Vision-Language Models (LVLMs) to enhance their in-context learning (ICL) capabilities. VICL addresses challenges in cross-modal interactions and representation disparities by incorporating Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. The method retrieves relevant images, summarizes them based on task intent, and composes language-based demonstrations that reduce token count and improve cross-modal interaction. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of VICL, showing significant improvements over baseline methods. Additionally, the method introduces in-context unlearning, allowing models to reset specific knowledge without retraining. Information flow analysis reveals the effectiveness of VICL in aligning visual and textual modalities. The study also investigates the impact of demonstration length, position, and order on LVLM performance, highlighting the importance of intent-oriented summaries and demonstration sequence. VICL outperforms ICL and zero-shot approaches, particularly in models with higher capacity. The method shows promise in bridging the gap between visual and linguistic modalities, enhancing LVLMs' ability to process multi-modal tasks. The paper also discusses limitations, including the dependency on LVLM performance and the need for further research on broader applications. Overall, VICL provides a novel approach to improve LVLMs' in-context learning and unlearning capabilities.This paper introduces Visual In-Context Learning (VICL) for Large Vision-Language Models (LVLMs) to enhance their in-context learning (ICL) capabilities. VICL addresses challenges in cross-modal interactions and representation disparities by incorporating Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. The method retrieves relevant images, summarizes them based on task intent, and composes language-based demonstrations that reduce token count and improve cross-modal interaction. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of VICL, showing significant improvements over baseline methods. Additionally, the method introduces in-context unlearning, allowing models to reset specific knowledge without retraining. Information flow analysis reveals the effectiveness of VICL in aligning visual and textual modalities. The study also investigates the impact of demonstration length, position, and order on LVLM performance, highlighting the importance of intent-oriented summaries and demonstration sequence. VICL outperforms ICL and zero-shot approaches, particularly in models with higher capacity. The method shows promise in bridging the gap between visual and linguistic modalities, enhancing LVLMs' ability to process multi-modal tasks. The paper also discusses limitations, including the dependency on LVLM performance and the need for further research on broader applications. Overall, VICL provides a novel approach to improve LVLMs' in-context learning and unlearning capabilities.
Reach us at info@study.space
[slides and audio] Visual In-Context Learning for Large Vision-Language Models