17 Jul 2024 | Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky
This paper presents a method to enhance the in-context learning (ICL) capabilities of Vision and Language Models (VLMs). Current state-of-the-art VLMs primarily rely on projecting vision tokens to language-like tokens for alignment with Large Language Models (LLMs), but they lack the ability to follow ICL instructions, which involve using a few examples in the prompt to reason about a downstream task. The authors propose a simple yet effective multi-turn curriculum-based learning methodology with effective data mixes, leading to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. They also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over prior art.
The authors analyze the effectiveness of different ICL instruction task types and data mixes, and show that semantically coherent ICL instructions significantly improve performance. They also demonstrate that their approach preserves the core capabilities of the base model, as measured by the MME benchmark. The method is based on a multi-turn ICL conversation format, where the model is trained to operate in any-shot scenarios, allowing it to accept any number of in-context demonstration 'shots' at inference time. The model is trained on a mix of semantically coherent ICL instructions, which are carefully curated to ensure a shared semantic concept across all shots in the instruction. The results show that the proposed method significantly improves ICL performance on a variety of tasks, including fine-grained few-shot visual recognition, multiple-choice QA, and captioning. The authors also highlight the importance of replaying non-ICL instruction data to preserve the base model's capabilities. The method is shown to be scalable, with performance improving as more ICL data is added. The results demonstrate that the proposed approach significantly enhances the ICL capabilities of VLMs, making them more effective in a wide range of tasks.This paper presents a method to enhance the in-context learning (ICL) capabilities of Vision and Language Models (VLMs). Current state-of-the-art VLMs primarily rely on projecting vision tokens to language-like tokens for alignment with Large Language Models (LLMs), but they lack the ability to follow ICL instructions, which involve using a few examples in the prompt to reason about a downstream task. The authors propose a simple yet effective multi-turn curriculum-based learning methodology with effective data mixes, leading to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. They also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over prior art.
The authors analyze the effectiveness of different ICL instruction task types and data mixes, and show that semantically coherent ICL instructions significantly improve performance. They also demonstrate that their approach preserves the core capabilities of the base model, as measured by the MME benchmark. The method is based on a multi-turn ICL conversation format, where the model is trained to operate in any-shot scenarios, allowing it to accept any number of in-context demonstration 'shots' at inference time. The model is trained on a mix of semantically coherent ICL instructions, which are carefully curated to ensure a shared semantic concept across all shots in the instruction. The results show that the proposed method significantly improves ICL performance on a variety of tasks, including fine-grained few-shot visual recognition, multiple-choice QA, and captioning. The authors also highlight the importance of replaying non-ICL instruction data to preserve the base model's capabilities. The method is shown to be scalable, with performance improving as more ICL data is added. The results demonstrate that the proposed approach significantly enhances the ICL capabilities of VLMs, making them more effective in a wide range of tasks.