24 Jul 2024 | Wenyu Lv, Yan Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu
RT-DETRv2 is an improved real-time detection transformer that builds upon the previous state-of-the-art real-time detector, RT-DETR. It introduces a set of "bag-of-freebies" to enhance flexibility and practicality, and optimizes the training strategy to achieve better performance. To improve flexibility, RT-DETRv2 sets a distinct number of sampling points for features at different scales in the deformable attention module to achieve selective multi-scale feature extraction. To enhance practicality, it proposes an optional discrete sampling operator to replace the grid_sample operator, which is specific to RT-DETR, thus eliminating deployment constraints typically associated with detection Transformers. The training strategy is optimized with dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. The results show that RT-DETRv2 provides an improved baseline with bag-of-freebies for RT-DETR, increasing flexibility and practicality, and the proposed training strategies optimize performance and training cost. The framework of RT-DETRv2 remains the same as RT-DETR, with only modifications to the deformable attention module of the decoder. The method includes distinct sampling points for different scales, discrete sampling, and scale-adaptive hyperparameters customization. The model is trained on the COCO train2017 dataset and validated on the COCO val2017 dataset. The results show that RT-DETRv2 outperforms RT-DETR at different scales of detectors without loss of speed. The ablation studies show that reducing the number of sampling points does not cause significant degradation in performance, and replacing grid_sample with discrete_sample does not cause a noticeable reduction in AP50val but eliminates deployment constraints of DETRs. RT-DETRv2 represents a novel, end-to-end, real-time detector that marks a significant advancement for the DETR family.RT-DETRv2 is an improved real-time detection transformer that builds upon the previous state-of-the-art real-time detector, RT-DETR. It introduces a set of "bag-of-freebies" to enhance flexibility and practicality, and optimizes the training strategy to achieve better performance. To improve flexibility, RT-DETRv2 sets a distinct number of sampling points for features at different scales in the deformable attention module to achieve selective multi-scale feature extraction. To enhance practicality, it proposes an optional discrete sampling operator to replace the grid_sample operator, which is specific to RT-DETR, thus eliminating deployment constraints typically associated with detection Transformers. The training strategy is optimized with dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. The results show that RT-DETRv2 provides an improved baseline with bag-of-freebies for RT-DETR, increasing flexibility and practicality, and the proposed training strategies optimize performance and training cost. The framework of RT-DETRv2 remains the same as RT-DETR, with only modifications to the deformable attention module of the decoder. The method includes distinct sampling points for different scales, discrete sampling, and scale-adaptive hyperparameters customization. The model is trained on the COCO train2017 dataset and validated on the COCO val2017 dataset. The results show that RT-DETRv2 outperforms RT-DETR at different scales of detectors without loss of speed. The ablation studies show that reducing the number of sampling points does not cause significant degradation in performance, and replacing grid_sample with discrete_sample does not cause a noticeable reduction in AP50val but eliminates deployment constraints of DETRs. RT-DETRv2 represents a novel, end-to-end, real-time detector that marks a significant advancement for the DETR family.