12 Oct 2021 | Aishwarya Kamath1 Mannat Singh2 Yann LeCun123 Gabriel Synnaeve2 Ishan Misra2 Nicolas Carion3
MDETR (Modulated Detection for End-to-End Multi-Modal Understanding) is an end-to-end modulated detector that integrates object detection with natural language understanding. It is designed to detect objects in an image based on raw text queries, such as captions or questions, by jointly reasoning over text and image data using a transformer-based architecture. The model is pre-trained on 1.3 million text-image pairs from existing multi-modal datasets, ensuring explicit alignment between text phrases and objects in images. After pre-training, MDETR is fine-tuned on various downstream tasks, including phrase grounding, referring expression comprehension, and segmentation, achieving state-of-the-art results on popular benchmarks. The paper also explores the model's performance in few-shot long-tailed object detection, demonstrating its ability to handle a wide range of object categories with few labeled instances. The code and models are available at <https://github.com/ashkamath/mdetr>.MDETR (Modulated Detection for End-to-End Multi-Modal Understanding) is an end-to-end modulated detector that integrates object detection with natural language understanding. It is designed to detect objects in an image based on raw text queries, such as captions or questions, by jointly reasoning over text and image data using a transformer-based architecture. The model is pre-trained on 1.3 million text-image pairs from existing multi-modal datasets, ensuring explicit alignment between text phrases and objects in images. After pre-training, MDETR is fine-tuned on various downstream tasks, including phrase grounding, referring expression comprehension, and segmentation, achieving state-of-the-art results on popular benchmarks. The paper also explores the model's performance in few-shot long-tailed object detection, demonstrating its ability to handle a wide range of object categories with few labeled instances. The code and models are available at <https://github.com/ashkamath/mdetr>.