BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

2022-07-13 | Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, Jifeng Dai
BEVFormer is a novel framework designed for autonomous driving that leverages spatiotemporal transformers to generate bird's-eye-view (BEV) representations from multi-camera inputs. The framework integrates spatial and temporal information by using predefined grid-shaped BEV queries to interact with spatial and temporal spaces. It employs spatial cross-attention to aggregate spatial features from different camera views and temporal self-attention to recurrently fuse historical BEV information. This approach significantly improves the accuracy of velocity estimation and object recall under low-visibility conditions. BEVFormer achieves a state-of-the-art performance on the nuScenes test set with a 56.9% NDS metric, outperforming previous methods by 9.0 points. The code for BEVFormer is available at <https://github.com/zhiqi-li/BEVFormer>.BEVFormer is a novel framework designed for autonomous driving that leverages spatiotemporal transformers to generate bird's-eye-view (BEV) representations from multi-camera inputs. The framework integrates spatial and temporal information by using predefined grid-shaped BEV queries to interact with spatial and temporal spaces. It employs spatial cross-attention to aggregate spatial features from different camera views and temporal self-attention to recurrently fuse historical BEV information. This approach significantly improves the accuracy of velocity estimation and object recall under low-visibility conditions. BEVFormer achieves a state-of-the-art performance on the nuScenes test set with a 56.9% NDS metric, outperforming previous methods by 9.0 points. The code for BEVFormer is available at <https://github.com/zhiqi-li/BEVFormer>.
Reach us at info@study.space
Understanding BEVFormer%3A Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers