23 May 2024 | Ao Wang, Hui Chen*, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding*
YOLOv10 is a real-time end-to-end object detection model that improves performance and efficiency by optimizing both post-processing and model architecture. The model eliminates the need for non-maximum suppression (NMS) during inference through a consistent dual assignment strategy, which provides rich supervision during training and high inference efficiency. Additionally, a holistic efficiency-accuracy driven model design strategy is introduced, which comprehensively optimizes various components of YOLOs from both efficiency and accuracy perspectives. This strategy includes lightweight classification heads, spatial-channel decoupled downsampling, and rank-guided block design to reduce computational overhead and enhance model capability. For accuracy, large-kernel convolutions and partial self-attention modules are employed to improve performance with minimal cost. YOLOv10 achieves state-of-the-art performance and efficiency across various model scales, with YOLOv10-S being 1.8× faster than RT-DETR-R18 and YOLOv10-B achieving 46% lower latency than YOLOv9-C. The model demonstrates significant improvements in both performance and efficiency, making it a highly effective real-time object detection solution.YOLOv10 is a real-time end-to-end object detection model that improves performance and efficiency by optimizing both post-processing and model architecture. The model eliminates the need for non-maximum suppression (NMS) during inference through a consistent dual assignment strategy, which provides rich supervision during training and high inference efficiency. Additionally, a holistic efficiency-accuracy driven model design strategy is introduced, which comprehensively optimizes various components of YOLOs from both efficiency and accuracy perspectives. This strategy includes lightweight classification heads, spatial-channel decoupled downsampling, and rank-guided block design to reduce computational overhead and enhance model capability. For accuracy, large-kernel convolutions and partial self-attention modules are employed to improve performance with minimal cost. YOLOv10 achieves state-of-the-art performance and efficiency across various model scales, with YOLOv10-S being 1.8× faster than RT-DETR-R18 and YOLOv10-B achieving 46% lower latency than YOLOv9-C. The model demonstrates significant improvements in both performance and efficiency, making it a highly effective real-time object detection solution.