[slides and audio] Flamingo%3A a Visual Language Model for Few-Shot Learning

Flamingo is a family of Visual Language Models (VLMs) designed to adapt rapidly to novel tasks using only a few annotated examples. The key architectural innovations include bridging powerful pretrained vision-only and language-only models, handling sequences of interleaved visual and textual data, and seamlessly ingesting images or videos as inputs. Flamingo models can be trained on large-scale multimodal web corpora, enabling them to perform in-context few-shot learning. The models achieve state-of-the-art performance on a wide range of open-ended vision and language tasks, outperforming fine-tuned models on 6 out of 16 tasks with significantly less task-specific training data. Flamingo's ability to handle interleaved text and visual sequences makes it suitable for tasks such as visual question-answering, captioning, and multiple-choice visual question-answering. The architecture leverages a Perceiver Resampler to convert visual features into a fixed number of tokens and gated cross-attention layers to condition the frozen language model on visual inputs. Flamingo's performance improves with model size and the number of shots, and it can be fine-tuned to set new state-of-the-art on additional challenging benchmarks. The paper also discusses limitations, societal impacts, and future work directions.Flamingo is a family of Visual Language Models (VLMs) designed to adapt rapidly to novel tasks using only a few annotated examples. The key architectural innovations include bridging powerful pretrained vision-only and language-only models, handling sequences of interleaved visual and textual data, and seamlessly ingesting images or videos as inputs. Flamingo models can be trained on large-scale multimodal web corpora, enabling them to perform in-context few-shot learning. The models achieve state-of-the-art performance on a wide range of open-ended vision and language tasks, outperforming fine-tuned models on 6 out of 16 tasks with significantly less task-specific training data. Flamingo's ability to handle interleaved text and visual sequences makes it suitable for tasks such as visual question-answering, captioning, and multiple-choice visual question-answering. The architecture leverages a Perceiver Resampler to convert visual features into a fixed number of tokens and gated cross-attention layers to condition the frozen language model on visual inputs. Flamingo's performance improves with model size and the number of shots, and it can be fine-tuned to set new state-of-the-art on additional challenging benchmarks. The paper also discusses limitations, societal impacts, and future work directions.

Flamingo: a Visual Language Model for Few-Shot Learning