Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

15 Nov 2022 | Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Flamingo is a family of Visual Language Models (VLMs) designed to adapt rapidly to novel tasks using only a few annotated examples. The key architectural innovations include bridging powerful pretrained vision-only and language-only models, handling sequences of interleaved visual and textual data, and seamlessly ingesting images or videos as inputs. Flamingo models can be trained on large-scale multimodal web corpora, enabling them to perform in-context few-shot learning. The models achieve state-of-the-art performance on a wide range of open-ended vision and language tasks, outperforming fine-tuned models on 6 out of 16 tasks with significantly less task-specific training data. Flamingo's ability to handle interleaved text and visual sequences makes it suitable for tasks such as visual question-answering, captioning, and multiple-choice visual question-answering. The architecture leverages a Perceiver Resampler to convert visual features into a fixed number of tokens and gated cross-attention layers to condition the frozen language model on visual inputs. Flamingo's performance improves with model size and the number of shots, and it can be fine-tuned to set new state-of-the-art on additional challenging benchmarks. The paper also discusses limitations, societal impacts, and future work directions.Flamingo is a family of Visual Language Models (VLMs) designed to adapt rapidly to novel tasks using only a few annotated examples. The key architectural innovations include bridging powerful pretrained vision-only and language-only models, handling sequences of interleaved visual and textual data, and seamlessly ingesting images or videos as inputs. Flamingo models can be trained on large-scale multimodal web corpora, enabling them to perform in-context few-shot learning. The models achieve state-of-the-art performance on a wide range of open-ended vision and language tasks, outperforming fine-tuned models on 6 out of 16 tasks with significantly less task-specific training data. Flamingo's ability to handle interleaved text and visual sequences makes it suitable for tasks such as visual question-answering, captioning, and multiple-choice visual question-answering. The architecture leverages a Perceiver Resampler to convert visual features into a fixed number of tokens and gated cross-attention layers to condition the frozen language model on visual inputs. Flamingo's performance improves with model size and the number of shots, and it can be fine-tuned to set new state-of-the-art on additional challenging benchmarks. The paper also discusses limitations, societal impacts, and future work directions.
Reach us at info@study.space