End-to-End Object Detection with Transformers

End-to-End Object Detection with Transformers

28 May 2020 | Nicolas Carion*, Francisco Massa*, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
This paper introduces DETR, a new method for object detection that treats the task as a direct set prediction problem. Unlike traditional methods that rely on hand-crafted components like anchor generation and non-maximum suppression, DETR uses a transformer encoder-decoder architecture and a set-based loss that forces unique predictions via bipartite matching. This approach simplifies the detection pipeline and eliminates the need for many manually designed components. DETR achieves performance comparable to the well-established Faster R-CNN baseline on the COCO dataset, and it can be easily extended to panoptic segmentation. The model is conceptually simple and does not require a specialized library, making it easy to implement in any framework that supports standard CNN and transformer classes. DETR outperforms competitive baselines, particularly in detecting large objects, and shows promise for future improvements in small object detection. The model is trained with a long schedule and benefits from auxiliary decoding losses. The paper also evaluates DETR on panoptic segmentation, demonstrating its ability to produce unified predictions for both things and stuff classes. The results show that DETR achieves competitive performance on the COCO dataset, with significant improvements in large object detection and strong results on panoptic segmentation. The model is flexible and extensible, and its design allows for efficient training and inference. The paper concludes that DETR is a competitive and effective approach for object detection, with potential for further improvements in small object detection and other tasks.This paper introduces DETR, a new method for object detection that treats the task as a direct set prediction problem. Unlike traditional methods that rely on hand-crafted components like anchor generation and non-maximum suppression, DETR uses a transformer encoder-decoder architecture and a set-based loss that forces unique predictions via bipartite matching. This approach simplifies the detection pipeline and eliminates the need for many manually designed components. DETR achieves performance comparable to the well-established Faster R-CNN baseline on the COCO dataset, and it can be easily extended to panoptic segmentation. The model is conceptually simple and does not require a specialized library, making it easy to implement in any framework that supports standard CNN and transformer classes. DETR outperforms competitive baselines, particularly in detecting large objects, and shows promise for future improvements in small object detection. The model is trained with a long schedule and benefits from auxiliary decoding losses. The paper also evaluates DETR on panoptic segmentation, demonstrating its ability to produce unified predictions for both things and stuff classes. The results show that DETR achieves competitive performance on the COCO dataset, with significant improvements in large object detection and strong results on panoptic segmentation. The model is flexible and extensible, and its design allows for efficient training and inference. The paper concludes that DETR is a competitive and effective approach for object detection, with potential for further improvements in small object detection and other tasks.
Reach us at info@study.space
Understanding End-to-End Object Detection with Transformers