MVSNet: Depth Inference for Unstructured Multi-view Stereo

MVSNet: Depth Inference for Unstructured Multi-view Stereo

17 Jul 2018 | Yao Yao1, Zixin Luo1, Shiwei Li1, Tian Fang2, and Long Quan1
MVSNet is an end-to-end deep learning architecture for depth map inference from multi-view images. The network first extracts deep visual features from input images and builds a 3D cost volume using differentiable homography warping. It then applies 3D convolutions to regularize and regress the initial depth map, which is refined with the reference image to generate the final output. MVSNet adapts to arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The method is evaluated on the DTU and Tanks and Temples datasets, where it outperforms previous state-of-the-art methods in terms of completeness and overall quality, and is significantly faster in runtime. MVSNet also demonstrates strong generalization ability on complex outdoor scenes without fine-tuning. The network's key innovation is encoding camera parameters into differentiable homography to build the cost volume on the camera frustum, bridging 2D feature extraction and 3D cost regularization. MVSNet achieves state-of-the-art results on the Tanks and Temples dataset without fine-tuning, showing its strong generalization ability. The method is efficient, with a running time of about 230 seconds per scan, and uses a Tesla P100 GPU. It is also memory-efficient, requiring only 16 GB of GPU memory. The network is trained on the DTU dataset, which provides ground truth point clouds with normal information. MVSNet is able to generate accurate depth maps for the reference image, and the depth map is refined using the reference image to improve boundary accuracy. The method is evaluated on the DTU and Tanks and Temples datasets, where it outperforms previous methods in terms of completeness and overall quality. The network is also tested on the Tanks and Temples dataset, where it ranks first before April 18, 2018, without any fine-tuning. The method is efficient, with a running time of about 230 seconds per scan, and uses a Tesla P100 GPU. It is also memory-efficient, requiring only 16 GB of GPU memory. The network is trained on the DTU dataset, which provides ground truth point clouds with normal information. MVSNet is able to generate accurate depth maps for the reference image, and the depth map is refined using the reference image to improve boundary accuracy.MVSNet is an end-to-end deep learning architecture for depth map inference from multi-view images. The network first extracts deep visual features from input images and builds a 3D cost volume using differentiable homography warping. It then applies 3D convolutions to regularize and regress the initial depth map, which is refined with the reference image to generate the final output. MVSNet adapts to arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The method is evaluated on the DTU and Tanks and Temples datasets, where it outperforms previous state-of-the-art methods in terms of completeness and overall quality, and is significantly faster in runtime. MVSNet also demonstrates strong generalization ability on complex outdoor scenes without fine-tuning. The network's key innovation is encoding camera parameters into differentiable homography to build the cost volume on the camera frustum, bridging 2D feature extraction and 3D cost regularization. MVSNet achieves state-of-the-art results on the Tanks and Temples dataset without fine-tuning, showing its strong generalization ability. The method is efficient, with a running time of about 230 seconds per scan, and uses a Tesla P100 GPU. It is also memory-efficient, requiring only 16 GB of GPU memory. The network is trained on the DTU dataset, which provides ground truth point clouds with normal information. MVSNet is able to generate accurate depth maps for the reference image, and the depth map is refined using the reference image to improve boundary accuracy. The method is evaluated on the DTU and Tanks and Temples datasets, where it outperforms previous methods in terms of completeness and overall quality. The network is also tested on the Tanks and Temples dataset, where it ranks first before April 18, 2018, without any fine-tuning. The method is efficient, with a running time of about 230 seconds per scan, and uses a Tesla P100 GPU. It is also memory-efficient, requiring only 16 GB of GPU memory. The network is trained on the DTU dataset, which provides ground truth point clouds with normal information. MVSNet is able to generate accurate depth maps for the reference image, and the depth map is refined using the reference image to improve boundary accuracy.
Reach us at info@study.space