[slides and audio] MS-DETR%3A Efficient DETR Training with Mixed Supervision

The paper introduces MS-DETR, an efficient training method for the DETR (Detection Transformer) model, which is an end-to-end object detection framework. DETR generates multiple object candidates based on image features and selects one candidate for each ground-truth object through one-to-one supervision. The traditional training procedure lacks direct supervision for these candidates, leading to suboptimal candidate quality. To address this, MS-DETR proposes a mixed supervision approach that combines one-to-one and one-to-many supervision. One-to-many supervision is applied to the primary decoder's object queries, improving candidate generation without introducing additional decoder branches or object queries. This approach enhances the quality of object queries and, consequently, the detection candidates. Experiments show that MS-DETR outperforms existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, both in terms of performance and efficiency. The method also complements other DETR variants, further improving their performance. Additionally, MS-DETR is more computationally and memory-efficient compared to other DETR variants. The paper includes a detailed analysis of the proposed method, including hyperparameter studies and ablation experiments, demonstrating the effectiveness of the mixed supervision approach. The results are validated on the COCO dataset, showing significant improvements in object detection and instance segmentation tasks.The paper introduces MS-DETR, an efficient training method for the DETR (Detection Transformer) model, which is an end-to-end object detection framework. DETR generates multiple object candidates based on image features and selects one candidate for each ground-truth object through one-to-one supervision. The traditional training procedure lacks direct supervision for these candidates, leading to suboptimal candidate quality. To address this, MS-DETR proposes a mixed supervision approach that combines one-to-one and one-to-many supervision. One-to-many supervision is applied to the primary decoder's object queries, improving candidate generation without introducing additional decoder branches or object queries. This approach enhances the quality of object queries and, consequently, the detection candidates. Experiments show that MS-DETR outperforms existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, both in terms of performance and efficiency. The method also complements other DETR variants, further improving their performance. Additionally, MS-DETR is more computationally and memory-efficient compared to other DETR variants. The paper includes a detailed analysis of the proposed method, including hyperparameter studies and ablation experiments, demonstrating the effectiveness of the mixed supervision approach. The results are validated on the COCO dataset, showing significant improvements in object detection and instance segmentation tasks.

MS-DETR: Efficient DETR Training with Mixed Supervision

8 Jan 2024 | Chuyang Zhao12, Yifan Sun1, Wenhao Wang3, Qiang Chen1, Errui Ding1, Yi Yang4, Jingdong Wang1†