BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

2022-07-13 | Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, Jifeng Dai
BEVFormer is a novel framework that learns bird's-eye-view (BEV) representations from multi-camera images using spatiotemporal transformers. The framework leverages queries to aggregate spatial and temporal information, enabling stronger representations for autonomous driving perception tasks. BEVFormer uses grid-shaped BEV queries to interact with spatial and temporal spaces, and incorporates spatial cross-attention and temporal self-attention modules to extract spatial and temporal features. The BEV features generated by BEVFormer can support multiple 3D perception tasks, including 3D object detection and map segmentation. BEVFormer achieves a state-of-the-art 56.9% NDS metric on the nuScenes test set, outperforming previous methods by 9.0 points. It also improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The framework is efficient, with negligible computational overhead, and can be integrated with different task-specific heads for end-to-end 3D object detection and map segmentation. BEVFormer demonstrates the effectiveness of spatiotemporal information in improving visual perception models, and its results show that it can outperform LiDAR-based baselines in certain tasks. The framework is evaluated on multiple challenging benchmarks, including nuScenes and Waymo, and shows consistent improvements over prior arts. The code is available at https://github.com/zhiqi-li/BEVFormer.BEVFormer is a novel framework that learns bird's-eye-view (BEV) representations from multi-camera images using spatiotemporal transformers. The framework leverages queries to aggregate spatial and temporal information, enabling stronger representations for autonomous driving perception tasks. BEVFormer uses grid-shaped BEV queries to interact with spatial and temporal spaces, and incorporates spatial cross-attention and temporal self-attention modules to extract spatial and temporal features. The BEV features generated by BEVFormer can support multiple 3D perception tasks, including 3D object detection and map segmentation. BEVFormer achieves a state-of-the-art 56.9% NDS metric on the nuScenes test set, outperforming previous methods by 9.0 points. It also improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The framework is efficient, with negligible computational overhead, and can be integrated with different task-specific heads for end-to-end 3D object detection and map segmentation. BEVFormer demonstrates the effectiveness of spatiotemporal information in improving visual perception models, and its results show that it can outperform LiDAR-based baselines in certain tasks. The framework is evaluated on multiple challenging benchmarks, including nuScenes and Waymo, and shows consistent improvements over prior arts. The code is available at https://github.com/zhiqi-li/BEVFormer.
Reach us at info@study.space