MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

12 Oct 2021 | Aishwarya Kamath1 Mannat Singh2 Yann LeCun123 Gabriel Synnaeve2 Ishan Misra2 Nicolas Carion3
MDETR is an end-to-end modulated detector that performs object detection conditioned on a raw text query, such as a caption or question. It uses a transformer-based architecture to jointly reason over text and image by fusing the two modalities early in the model. The model is pre-trained on 1.3M text-image pairs from existing multi-modal datasets with explicit alignment between text phrases and image objects. It is then fine-tuned on downstream tasks such as phrase grounding, referring expression comprehension, and segmentation, achieving state-of-the-art results on benchmarks like GQA and CLEVR. MDETR can also handle long-tailed object categories with few labeled instances by leveraging pre-training. The model is extended for visual question answering and referring expression segmentation, and it can be used as an object detector on a given label set when fine-tuned in a few-shot setting. MDETR's architecture includes a convolutional backbone for visual features and a language model for text features, which are projected into a shared embedding space and fed into a transformer encoder-decoder. The model uses two additional loss functions to encourage alignment between image and text: a soft token prediction loss and a text-query contrastive alignment loss. These losses help the model learn to disambiguate between multiple occurrences of the same object category and align object queries with text tokens. MDETR is evaluated on several downstream tasks, including referring expression comprehension, visual question answering, and phrase grounding, and achieves competitive performance on these tasks. The model is also tested on a few-shot transfer setting for long-tailed detection, showing improved performance on rare categories. MDETR's approach allows for more tightly integrated multi-modal architectures by enabling information flow between modalities at an earlier stage of the model. The model is trained on a combined dataset of images from Flickr30k, MS COCO, and Visual Genome, with annotations from referring expressions datasets, VG regions, and GQA train balanced set. The model uses a pre-trained RoBERTa-base as its text encoder and an EfficientNet backbone for visual features. MDETR achieves strong performance on various datasets, including CLEVR, and outperforms existing methods on tasks such as visual question answering and referring expression segmentation. The model is also effective in few-shot transfer settings, demonstrating its ability to handle long-tailed object detection.MDETR is an end-to-end modulated detector that performs object detection conditioned on a raw text query, such as a caption or question. It uses a transformer-based architecture to jointly reason over text and image by fusing the two modalities early in the model. The model is pre-trained on 1.3M text-image pairs from existing multi-modal datasets with explicit alignment between text phrases and image objects. It is then fine-tuned on downstream tasks such as phrase grounding, referring expression comprehension, and segmentation, achieving state-of-the-art results on benchmarks like GQA and CLEVR. MDETR can also handle long-tailed object categories with few labeled instances by leveraging pre-training. The model is extended for visual question answering and referring expression segmentation, and it can be used as an object detector on a given label set when fine-tuned in a few-shot setting. MDETR's architecture includes a convolutional backbone for visual features and a language model for text features, which are projected into a shared embedding space and fed into a transformer encoder-decoder. The model uses two additional loss functions to encourage alignment between image and text: a soft token prediction loss and a text-query contrastive alignment loss. These losses help the model learn to disambiguate between multiple occurrences of the same object category and align object queries with text tokens. MDETR is evaluated on several downstream tasks, including referring expression comprehension, visual question answering, and phrase grounding, and achieves competitive performance on these tasks. The model is also tested on a few-shot transfer setting for long-tailed detection, showing improved performance on rare categories. MDETR's approach allows for more tightly integrated multi-modal architectures by enabling information flow between modalities at an earlier stage of the model. The model is trained on a combined dataset of images from Flickr30k, MS COCO, and Visual Genome, with annotations from referring expressions datasets, VG regions, and GQA train balanced set. The model uses a pre-trained RoBERTa-base as its text encoder and an EfficientNet backbone for visual features. MDETR achieves strong performance on various datasets, including CLEVR, and outperforms existing methods on tasks such as visual question answering and referring expression segmentation. The model is also effective in few-shot transfer settings, demonstrating its ability to handle long-tailed object detection.
Reach us at info@study.space
[slides and audio] MDETR - Modulated Detection for End-to-End Multi-Modal Understanding