16 May 2024 | Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng
This paper evaluates the performance of many-shot in-context learning (ICL) for multimodal foundation models, focusing on GPT-4o and Gemini 1.5 Pro. The study benchmarks these models across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). The results show that increasing the number of demonstrating examples significantly improves model performance, with Gemini 1.5 Pro demonstrating log-linear improvements across most datasets. In contrast, GPT-4o shows less stable improvements. The study also explores the impact of batching multiple queries in a single API call, finding that batching up to 50 queries can lead to performance improvements, especially in zero-shot settings, while drastically reducing per-query cost and latency. Additionally, the study measures ICL data efficiency, finding that Gemini 1.5 Pro exhibits higher data efficiency than GPT-4o on most datasets. The results suggest that many-shot ICL enables efficient adaptation of multimodal foundation models to new applications and domains. The codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL.This paper evaluates the performance of many-shot in-context learning (ICL) for multimodal foundation models, focusing on GPT-4o and Gemini 1.5 Pro. The study benchmarks these models across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). The results show that increasing the number of demonstrating examples significantly improves model performance, with Gemini 1.5 Pro demonstrating log-linear improvements across most datasets. In contrast, GPT-4o shows less stable improvements. The study also explores the impact of batching multiple queries in a single API call, finding that batching up to 50 queries can lead to performance improvements, especially in zero-shot settings, while drastically reducing per-query cost and latency. Additionally, the study measures ICL data efficiency, finding that Gemini 1.5 Pro exhibits higher data efficiency than GPT-4o on most datasets. The results suggest that many-shot ICL enables efficient adaptation of multimodal foundation models to new applications and domains. The codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL.