This paper introduces ViTAS, a novel approach to adapt vision foundation models (VFMs) for stereo matching. Stereo matching is crucial for 3D environment perception in intelligent vehicles, and while traditional CNNs have been widely used, there is growing interest in VFMs, particularly those based on Vision Transformers (ViTs) and pre-trained through self-supervision. However, VFMs often struggle with geometric vision tasks like stereo matching. This study explores how to adapt VFMs for stereo matching, proposing ViTAS, which combines three modules: spatial differentiation, patch attention fusion, and cross-attention. ViTAStereo, which integrates ViTAS with cost volume-based stereo matching, achieves top performance on the KITTI Stereo 2012 dataset and outperforms the second-best network, StereoBase, by approximately 7.9% in terms of error pixel percentage. Additional experiments across diverse scenarios demonstrate its superior generalizability compared to other state-of-the-art approaches. The study argues that stereo matching networks relying solely on cross-attention mechanisms have limited generalizability due to the absence of cost volumes. The proposed ViTAS leverages the strengths of VFMs to enhance feature distinctiveness, reducing ambiguities in stereo matching. The results show that ViTAStereo achieves high accuracy and generalizability, setting a new standard in stereo matching. The paper also discusses the limitations of cost volume-free networks and highlights the importance of cost volumes in achieving generalizable stereo matching. The study concludes that ViTAS represents a significant advancement in stereo matching by effectively utilizing VFMs.This paper introduces ViTAS, a novel approach to adapt vision foundation models (VFMs) for stereo matching. Stereo matching is crucial for 3D environment perception in intelligent vehicles, and while traditional CNNs have been widely used, there is growing interest in VFMs, particularly those based on Vision Transformers (ViTs) and pre-trained through self-supervision. However, VFMs often struggle with geometric vision tasks like stereo matching. This study explores how to adapt VFMs for stereo matching, proposing ViTAS, which combines three modules: spatial differentiation, patch attention fusion, and cross-attention. ViTAStereo, which integrates ViTAS with cost volume-based stereo matching, achieves top performance on the KITTI Stereo 2012 dataset and outperforms the second-best network, StereoBase, by approximately 7.9% in terms of error pixel percentage. Additional experiments across diverse scenarios demonstrate its superior generalizability compared to other state-of-the-art approaches. The study argues that stereo matching networks relying solely on cross-attention mechanisms have limited generalizability due to the absence of cost volumes. The proposed ViTAS leverages the strengths of VFMs to enhance feature distinctiveness, reducing ambiguities in stereo matching. The results show that ViTAStereo achieves high accuracy and generalizability, setting a new standard in stereo matching. The paper also discusses the limitations of cost volume-free networks and highlights the importance of cost volumes in achieving generalizable stereo matching. The study concludes that ViTAS represents a significant advancement in stereo matching by effectively utilizing VFMs.