1 Mar 2024 | Xianqi Wang, Gangwei Xu, Hao Jia, Xin Yang
Selective-Stereo is a novel method for stereo matching that addresses the limitations of traditional recurrent units by introducing a Selective Recurrent Unit (SRU) and a Contextual Spatial Attention (CSA) module. Traditional stereo matching methods, such as RAFT-Stereo and IGEV-Stereo, struggle to capture both high-frequency edge information and low-frequency smooth region information due to fixed receptive fields, leading to loss of details and false matches in textureless areas. Selective-Stereo improves upon these methods by adaptively fusing hidden disparity information across multiple frequencies, enabling the network to better capture details in edge and smooth regions.
The SRU module uses the CSA module to generate attention maps that guide the fusion of hidden disparity information from different frequencies. The CSA module extracts attention maps from context information, allowing the network to adaptively select suitable information based on different image regions. This approach enables the network to capture information from different receptive fields at different frequencies, while also performing secondary filtering to reduce noise information from the local cost volume.
Selective-Stereo has been tested on several stereo benchmarks, including Scene Flow, KITTI 2012, KITTI 2015, ETH3D, and Middlebury. On Scene Flow, Selective-RAFT achieves a state-of-the-art EPE of 0.47, while Selective-IGEV achieves a new state-of-the-art EPE of 0.44. On KITTI 2012, KITTI 2015, ETH3D, and Middlebury leaderboards, Selective-IGEV ranks first among all published methods. The method has also been shown to outperform existing methods in terms of performance on these benchmarks.
The method's effectiveness is demonstrated through ablation studies, which show that the proposed modules significantly improve performance. The SRU and CSA modules are shown to be effective in different settings and can be applied to various iterative stereo matching methods. The method's ability to adaptively select information based on different image regions is validated through qualitative results, which show that Selective-Stereo outperforms existing methods in detailed and weak texture regions.
The method's performance is also validated through quantitative evaluations on various datasets, including Scene Flow, KITTI 2012, KITTI 2015, ETH3D, and Middlebury. The results show that Selective-Stereo achieves state-of-the-art performance on these benchmarks, with Selective-IGEV ranking first on KITTI, ETH3D, and Middlebury leaderboards. The method's ability to adaptively select information based on different image regions is further validated through qualitative results, which show that Selective-Stereo outperforms existing methods in detailed and weak texture regions.Selective-Stereo is a novel method for stereo matching that addresses the limitations of traditional recurrent units by introducing a Selective Recurrent Unit (SRU) and a Contextual Spatial Attention (CSA) module. Traditional stereo matching methods, such as RAFT-Stereo and IGEV-Stereo, struggle to capture both high-frequency edge information and low-frequency smooth region information due to fixed receptive fields, leading to loss of details and false matches in textureless areas. Selective-Stereo improves upon these methods by adaptively fusing hidden disparity information across multiple frequencies, enabling the network to better capture details in edge and smooth regions.
The SRU module uses the CSA module to generate attention maps that guide the fusion of hidden disparity information from different frequencies. The CSA module extracts attention maps from context information, allowing the network to adaptively select suitable information based on different image regions. This approach enables the network to capture information from different receptive fields at different frequencies, while also performing secondary filtering to reduce noise information from the local cost volume.
Selective-Stereo has been tested on several stereo benchmarks, including Scene Flow, KITTI 2012, KITTI 2015, ETH3D, and Middlebury. On Scene Flow, Selective-RAFT achieves a state-of-the-art EPE of 0.47, while Selective-IGEV achieves a new state-of-the-art EPE of 0.44. On KITTI 2012, KITTI 2015, ETH3D, and Middlebury leaderboards, Selective-IGEV ranks first among all published methods. The method has also been shown to outperform existing methods in terms of performance on these benchmarks.
The method's effectiveness is demonstrated through ablation studies, which show that the proposed modules significantly improve performance. The SRU and CSA modules are shown to be effective in different settings and can be applied to various iterative stereo matching methods. The method's ability to adaptively select information based on different image regions is validated through qualitative results, which show that Selective-Stereo outperforms existing methods in detailed and weak texture regions.
The method's performance is also validated through quantitative evaluations on various datasets, including Scene Flow, KITTI 2012, KITTI 2015, ETH3D, and Middlebury. The results show that Selective-Stereo achieves state-of-the-art performance on these benchmarks, with Selective-IGEV ranking first on KITTI, ETH3D, and Middlebury leaderboards. The method's ability to adaptively select information based on different image regions is further validated through qualitative results, which show that Selective-Stereo outperforms existing methods in detailed and weak texture regions.