[slides and audio] Many-Shot In-Context Learning in Multimodal Foundation Models

This paper evaluates the performance of multimodal foundation models, specifically GPT-4o and Gemini 1.5 Pro, in many-shot in-context learning (ICL). The study spans 10 datasets from various domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). Key findings include: 1. **Performance Improvements**: Many-shot ICL, with up to nearly 2,000 demonstrating examples, significantly enhances model performance compared to few-shot ICL (<100 examples). Gemini 1.5 Pro shows log-linear improvement up to the maximum tested number of examples on most datasets. 2. **Data Efficiency**: Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets, achieving substantial improvements in accuracy per additional demonstrating example. 3. **Batching Queries**: Batching multiple queries in a single API call reduces per-query cost and latency while maintaining or improving performance. This is particularly beneficial in the zero-shot setting, where batching can lead to significant performance gains. 4. **Sensitivity to Prompt Selection**: The study explores the robustness of many-shot ICL to different prompt formulations, finding consistent log-linear improvement trends across various prompts. 5. **Cost and Latency Analysis**: Query batching reduces per-example latency and cost, with substantial reductions observed in both metrics for many-shot ICL. The results suggest that many-shot ICL can enable efficient adaptation of multimodal foundation models to new applications and domains, making these models more adaptable and accessible. The codebase for the experiments is publicly available at <https://github.com/stanfordmlgroup/ManyICL>.This paper evaluates the performance of multimodal foundation models, specifically GPT-4o and Gemini 1.5 Pro, in many-shot in-context learning (ICL). The study spans 10 datasets from various domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). Key findings include: 1. **Performance Improvements**: Many-shot ICL, with up to nearly 2,000 demonstrating examples, significantly enhances model performance compared to few-shot ICL (<100 examples). Gemini 1.5 Pro shows log-linear improvement up to the maximum tested number of examples on most datasets. 2. **Data Efficiency**: Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets, achieving substantial improvements in accuracy per additional demonstrating example. 3. **Batching Queries**: Batching multiple queries in a single API call reduces per-query cost and latency while maintaining or improving performance. This is particularly beneficial in the zero-shot setting, where batching can lead to significant performance gains. 4. **Sensitivity to Prompt Selection**: The study explores the robustness of many-shot ICL to different prompt formulations, finding consistent log-linear improvement trends across various prompts. 5. **Cost and Latency Analysis**: Query batching reduces per-example latency and cost, with substantial reductions observed in both metrics for many-shot ICL. The results suggest that many-shot ICL can enable efficient adaptation of multimodal foundation models to new applications and domains, making these models more adaptable and accessible. The codebase for the experiments is publicly available at <https://github.com/stanfordmlgroup/ManyICL>.

Many-Shot In-Context Learning in Multimodal Foundation Models

16 May 2024 | Yixing Jiang*, Jeremy Irvin*, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng

16 May 2024 | Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng