Understanding SparseDrive%3A End-to-End Autonomous Driving via Sparse Scene Representation

**Abstract:** The modular autonomous driving system, which decouples tasks such as perception, prediction, and planning, suffers from information loss and error accumulation. In contrast, end-to-end paradigms unify these tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. However, existing methods often rely on computationally expensive BEV (bird's eye view) features and straightforward designs for prediction and planning, leading to suboptimal performance and efficiency. To address this, we propose SparseDrive, a new paradigm that leverages sparse scene representation. SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking, and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. The parallel motion planner simultaneously performs motion prediction and planning, incorporating a hierarchical planning selection strategy with a collision-aware rescore module to ensure safe and rational trajectory selection. SparseDrive outperforms previous state-of-the-art methods in all tasks while achieving significantly higher training and inference efficiency. **Introduction:** Traditional autonomous driving systems are modular, leading to information loss and cumulative errors. End-to-end paradigms integrate all tasks into a holistic model, but existing methods often rely on costly BEV features and straightforward designs, limiting performance and efficiency. SparseDrive addresses these issues by proposing a sparse-Centric paradigm, which unifies multiple tasks with sparse instance representation. It features a symmetric sparse perception module and a parallel motion planner, ensuring efficient and safe trajectory planning. **Method:** SparseDrive's architecture includes an image encoder, symmetric sparse perception, and parallel motion planner. The image encoder processes multi-view images to generate feature maps, which are then aggregated into instance features for sparse perception. The parallel motion planner predicts multi-modal trajectories for surrounding agents and the ego vehicle, selecting a safe trajectory through a hierarchical planning selection strategy. **Experiments:** SparseDrive is evaluated on the nuScenes dataset, achieving superior performance in perception tasks (3D detection, multi-object tracking, online mapping) and planning tasks (motion prediction, planning). It also demonstrates high efficiency in training and inference, outperforming previous state-of-the-art methods. **Conclusion:** SparseDrive achieves both remarkable performance and high efficiency in end-to-end autonomous driving. Future work will focus on improving single-task performance and expanding dataset scale to fully leverage the potential of end-to-end methods.**Abstract:** The modular autonomous driving system, which decouples tasks such as perception, prediction, and planning, suffers from information loss and error accumulation. In contrast, end-to-end paradigms unify these tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. However, existing methods often rely on computationally expensive BEV (bird's eye view) features and straightforward designs for prediction and planning, leading to suboptimal performance and efficiency. To address this, we propose SparseDrive, a new paradigm that leverages sparse scene representation. SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking, and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. The parallel motion planner simultaneously performs motion prediction and planning, incorporating a hierarchical planning selection strategy with a collision-aware rescore module to ensure safe and rational trajectory selection. SparseDrive outperforms previous state-of-the-art methods in all tasks while achieving significantly higher training and inference efficiency. **Introduction:** Traditional autonomous driving systems are modular, leading to information loss and cumulative errors. End-to-end paradigms integrate all tasks into a holistic model, but existing methods often rely on costly BEV features and straightforward designs, limiting performance and efficiency. SparseDrive addresses these issues by proposing a sparse-Centric paradigm, which unifies multiple tasks with sparse instance representation. It features a symmetric sparse perception module and a parallel motion planner, ensuring efficient and safe trajectory planning. **Method:** SparseDrive's architecture includes an image encoder, symmetric sparse perception, and parallel motion planner. The image encoder processes multi-view images to generate feature maps, which are then aggregated into instance features for sparse perception. The parallel motion planner predicts multi-modal trajectories for surrounding agents and the ego vehicle, selecting a safe trajectory through a hierarchical planning selection strategy. **Experiments:** SparseDrive is evaluated on the nuScenes dataset, achieving superior performance in perception tasks (3D detection, multi-object tracking, online mapping) and planning tasks (motion prediction, planning). It also demonstrates high efficiency in training and inference, outperforming previous state-of-the-art methods. **Conclusion:** SparseDrive achieves both remarkable performance and high efficiency in end-to-end autonomous driving. Future work will focus on improving single-task performance and expanding dataset scale to fully leverage the potential of end-to-end methods.

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

31 May 2024 | Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, Sifa Zheng