17 Jul 2024 | Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky
The paper addresses the challenges and solutions for multimodal in-context learning (ICL) in vision and language models (VLMs). It highlights the importance of grounding situation recognition and the need for transferable visual models from natural language supervision. The paper also discusses the use of large-scale weak supervision for robust speech recognition and the importance of generative multimodal models in ICL. The authors propose a simple yet effective multi-turn curriculum-based learning methodology with effective data mixes, leading to a significant 21.03% ICL performance boost over state-of-the-art VLM baselines and various ICL benchmarks. They also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over prior art. The paper includes extensive evaluations and ablation studies to demonstrate the effectiveness of their approach.The paper addresses the challenges and solutions for multimodal in-context learning (ICL) in vision and language models (VLMs). It highlights the importance of grounding situation recognition and the need for transferable visual models from natural language supervision. The paper also discusses the use of large-scale weak supervision for robust speech recognition and the importance of generative multimodal models in ICL. The authors propose a simple yet effective multi-turn curriculum-based learning methodology with effective data mixes, leading to a significant 21.03% ICL performance boost over state-of-the-art VLM baselines and various ICL benchmarks. They also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over prior art. The paper includes extensive evaluations and ablation studies to demonstrate the effectiveness of their approach.