InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

29 Apr 2024 | Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall
InverseMatrixVT3D is an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. This paper introduces a novel projection matrix-based approach that leverages two projection matrices to store static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. The method performs matrix multiplications between multi-view image feature maps and two sparse projection matrices. A sparse matrix handling technique is introduced to optimize GPU memory usage. Additionally, a global-local attention fusion module is proposed to integrate global BEV features with local 3D feature volumes to obtain the final 3D volume. A multi-scale supervision mechanism is also employed to enhance performance. Extensive experiments on the nuScenes and SemanticKITTI datasets show that the approach is simple and effective, achieving top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code is available at https://github.com/DanielMing123/InverseMatrixVT3D. The method uses multi-camera images to generate a dense 3D occupancy grid of the surrounding scene. It first extracts multi-scale features using a 2D backbone network, then constructs multi-scale 3D feature volumes and BEV planes using projection matrices. A global-local attention fusion module merges information from these features to produce the final 3D volume. The 3D volumes are upscaled using 3D deconvolution and integrated with higher-level volumes through skip-connections. The method does not require depth estimation or transformer-based querying, making the 3D volume generation process simple and efficient. It achieves competitive performance in 3D semantic occupancy prediction and monocular semantic scene completion tasks. The method's performance is evaluated using intersection over union (IoU) metrics, and it outperforms several state-of-the-art methods in detecting VRU. The method's efficiency is demonstrated through model size and inference time comparisons. Ablation studies show that the global-local attention fusion module and multi-scale mechanism are crucial for performance. The method's effectiveness is demonstrated through qualitative analysis of challenging scenes.InverseMatrixVT3D is an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. This paper introduces a novel projection matrix-based approach that leverages two projection matrices to store static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. The method performs matrix multiplications between multi-view image feature maps and two sparse projection matrices. A sparse matrix handling technique is introduced to optimize GPU memory usage. Additionally, a global-local attention fusion module is proposed to integrate global BEV features with local 3D feature volumes to obtain the final 3D volume. A multi-scale supervision mechanism is also employed to enhance performance. Extensive experiments on the nuScenes and SemanticKITTI datasets show that the approach is simple and effective, achieving top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code is available at https://github.com/DanielMing123/InverseMatrixVT3D. The method uses multi-camera images to generate a dense 3D occupancy grid of the surrounding scene. It first extracts multi-scale features using a 2D backbone network, then constructs multi-scale 3D feature volumes and BEV planes using projection matrices. A global-local attention fusion module merges information from these features to produce the final 3D volume. The 3D volumes are upscaled using 3D deconvolution and integrated with higher-level volumes through skip-connections. The method does not require depth estimation or transformer-based querying, making the 3D volume generation process simple and efficient. It achieves competitive performance in 3D semantic occupancy prediction and monocular semantic scene completion tasks. The method's performance is evaluated using intersection over union (IoU) metrics, and it outperforms several state-of-the-art methods in detecting VRU. The method's efficiency is demonstrated through model size and inference time comparisons. Ablation studies show that the global-local attention fusion module and multi-scale mechanism are crucial for performance. The method's effectiveness is demonstrated through qualitative analysis of challenging scenes.
Reach us at info@study.space