BEVFormer is a novel framework designed for autonomous driving that leverages spatiotemporal transformers to generate bird's-eye-view (BEV) representations from multi-camera inputs. The framework integrates spatial and temporal information by using predefined grid-shaped BEV queries to interact with spatial and temporal spaces. It employs spatial cross-attention to aggregate spatial features from different camera views and temporal self-attention to recurrently fuse historical BEV information. This approach significantly improves the accuracy of velocity estimation and object recall under low-visibility conditions. BEVFormer achieves a state-of-the-art performance on the nuScenes test set with a 56.9% NDS metric, outperforming previous methods by 9.0 points. The code for BEVFormer is available at <https://github.com/zhiqi-li/BEVFormer>.BEVFormer is a novel framework designed for autonomous driving that leverages spatiotemporal transformers to generate bird's-eye-view (BEV) representations from multi-camera inputs. The framework integrates spatial and temporal information by using predefined grid-shaped BEV queries to interact with spatial and temporal spaces. It employs spatial cross-attention to aggregate spatial features from different camera views and temporal self-attention to recurrently fuse historical BEV information. This approach significantly improves the accuracy of velocity estimation and object recall under low-visibility conditions. BEVFormer achieves a state-of-the-art performance on the nuScenes test set with a 56.9% NDS metric, outperforming previous methods by 9.0 points. The code for BEVFormer is available at <https://github.com/zhiqi-li/BEVFormer>.