Visual In-Context Learning for Large Vision-Language Models

Visual In-Context Learning for Large Vision-Language Models

18 Feb 2024 | Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen
The paper introduces a novel method called Visual In-Context Learning (VICL) to enhance the In-Context Learning (ICL) capabilities of Large Visual Language Models (LVLMs). VICL addresses the challenges of cross-modal interactions and representation disparities in LVLMs. The method consists of three main components: Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Visual Demonstration Retrieval uses a pre-trained image encoder to retrieve relevant images, followed by reranking using textual descriptions. Intent-Oriented Image Summarization generates task-specific visual summaries from image-label pairs. Intent-Oriented Demonstration Composition integrates these summaries into demonstrations, reducing token count and improving ICL performance. Experimental results on five visual reasoning datasets demonstrate the effectiveness of VICL, showing significant improvements over baseline methods. The paper also explores the impact of demonstration length and position, and introduces in-context unlearning to reset specific model knowledge without retraining.The paper introduces a novel method called Visual In-Context Learning (VICL) to enhance the In-Context Learning (ICL) capabilities of Large Visual Language Models (LVLMs). VICL addresses the challenges of cross-modal interactions and representation disparities in LVLMs. The method consists of three main components: Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Visual Demonstration Retrieval uses a pre-trained image encoder to retrieve relevant images, followed by reranking using textual descriptions. Intent-Oriented Image Summarization generates task-specific visual summaries from image-label pairs. Intent-Oriented Demonstration Composition integrates these summaries into demonstrations, reducing token count and improving ICL performance. Experimental results on five visual reasoning datasets demonstrate the effectiveness of VICL, showing significant improvements over baseline methods. The paper also explores the impact of demonstration length and position, and introduces in-context unlearning to reset specific model knowledge without retraining.
Reach us at info@study.space
Understanding Visual In-Context Learning for Large Vision-Language Models