[slides] Playing to Vision Foundation Model's Strengths in Stereo Matching

This paper explores the adaptation of Vision Foundation Models (VFM) to stereo matching, a critical technique for 3D environment perception in intelligent vehicles. Traditional approaches often rely on Convolutional Neural Networks (CNNs), but VFM, particularly those based on Vision Transformers (ViTs) and pre-trained on large, unlabeled datasets, offer more general-purpose visual features. However, VFM performance in geometric vision tasks, such as stereo matching, is often lacking. The study introduces ViTAS, a ViT adapter designed to enhance stereo matching performance. ViTAS consists of three modules: Spatial Differentiation Module (SDM), Patch Attention Fusion Module (PAFM), and Cross-Attention Module (CAM). These modules collectively improve feature extraction and aggregation, leading to superior performance on the KITTI Stereo 2012 dataset, outperforming the second-best network, StereoBase, by approximately 7.9% in terms of the percentage of error pixels. Extensive experiments across various datasets further demonstrate the generalizability and robustness of ViTAS. The paper concludes by highlighting the importance of leveraging VFM features for stereo matching and the need for further research to optimize the cross-attention mechanism.This paper explores the adaptation of Vision Foundation Models (VFM) to stereo matching, a critical technique for 3D environment perception in intelligent vehicles. Traditional approaches often rely on Convolutional Neural Networks (CNNs), but VFM, particularly those based on Vision Transformers (ViTs) and pre-trained on large, unlabeled datasets, offer more general-purpose visual features. However, VFM performance in geometric vision tasks, such as stereo matching, is often lacking. The study introduces ViTAS, a ViT adapter designed to enhance stereo matching performance. ViTAS consists of three modules: Spatial Differentiation Module (SDM), Patch Attention Fusion Module (PAFM), and Cross-Attention Module (CAM). These modules collectively improve feature extraction and aggregation, leading to superior performance on the KITTI Stereo 2012 dataset, outperforming the second-best network, StereoBase, by approximately 7.9% in terms of the percentage of error pixels. Extensive experiments across various datasets further demonstrate the generalizability and robustness of ViTAS. The paper concludes by highlighting the importance of leveraging VFM features for stereo matching and the need for further research to optimize the cross-attention mechanism.

Playing to Vision Foundation Model’s Strengths in Stereo Matching

9 Apr 2024 | Chuang-Wei Liu, Qijun Chen, Senior Member, IEEE, and Rui Fan, Senior Member, IEEE