Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

21 Jun 2024 | Brandon Huang, Chancharik Mitra, Assaf Arbel, Leonid Karlinsky, Trevor Darrell, Roei Herzig
This paper introduces Multimodal Task Vectors (MTV), a method to enable many-shot multimodal in-context learning (ICL) in large multimodal models (LMMs). The key idea is to use compact, implicit representations of in-context examples, called MTV, which are compressed into the model's attention heads. These representations allow LMMs to learn from many examples without being limited by the model's context length. The method involves three steps: (1) calculating the mean activations of attention heads from many-shot ICL examples, (2) extracting attention head locations that best align with the downstream task, and (3) replacing the calculated mean activations with the selected attention head locations for downstream inference. The proposed method demonstrates that MTV can scale with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length. It also outperforms zero-shot and few-shot ICL settings on various vision-language benchmarks. The method is efficient, requiring less runtime and memory compared to traditional many-shot ICL approaches. It also allows for the use of additional explicit ICL examples and generalizes to other tasks. The paper evaluates the method on several LMMs, including Qwen-VL, Idefics2-8B, and ViLA-1.5-8B, showing that MTV improves performance on tasks such as visual question answering and object classification. The results indicate that MTV can effectively encode many multimodal ICL examples with greater efficiency than few-shot methods. The method is also compared to other approaches like LoRA finetuning and task vectors, showing that MTV outperforms them in several aspects. The paper concludes that MTV is a viable solution for surpassing the context length limitations of LMMs for multimodal ICL and demonstrates the effectiveness of using many examples for multimodal reasoning.This paper introduces Multimodal Task Vectors (MTV), a method to enable many-shot multimodal in-context learning (ICL) in large multimodal models (LMMs). The key idea is to use compact, implicit representations of in-context examples, called MTV, which are compressed into the model's attention heads. These representations allow LMMs to learn from many examples without being limited by the model's context length. The method involves three steps: (1) calculating the mean activations of attention heads from many-shot ICL examples, (2) extracting attention head locations that best align with the downstream task, and (3) replacing the calculated mean activations with the selected attention head locations for downstream inference. The proposed method demonstrates that MTV can scale with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length. It also outperforms zero-shot and few-shot ICL settings on various vision-language benchmarks. The method is efficient, requiring less runtime and memory compared to traditional many-shot ICL approaches. It also allows for the use of additional explicit ICL examples and generalizes to other tasks. The paper evaluates the method on several LMMs, including Qwen-VL, Idefics2-8B, and ViLA-1.5-8B, showing that MTV improves performance on tasks such as visual question answering and object classification. The results indicate that MTV can effectively encode many multimodal ICL examples with greater efficiency than few-shot methods. The method is also compared to other approaches like LoRA finetuning and task vectors, showing that MTV outperforms them in several aspects. The paper concludes that MTV is a viable solution for surpassing the context length limitations of LMMs for multimodal ICL and demonstrates the effectiveness of using many examples for multimodal reasoning.
Reach us at info@study.space