2024 | Jason Y. Zhang, Amy Lin*, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani
This paper proposes a novel approach for camera pose estimation using a distributed ray representation, which treats a camera as a bundle of rays. Unlike traditional methods that predict global camera parameters, this approach models camera poses as a set of rays passing through image patches. This representation allows for tighter coupling with spatial image features, improving pose estimation accuracy. The method is based on a regression approach that maps image patches to corresponding rays, and is extended to a denoising diffusion model to handle uncertainties in sparse-view pose inference. The proposed methods demonstrate state-of-the-art performance on the CO3D dataset, generalizing to unseen object categories and in-the-wild captures. The ray-based representation is naturally suited for set-level transformers and allows for efficient inference. The method is evaluated on various metrics, including rotation and camera center accuracy, showing significant improvements over existing approaches. The ray diffusion model is shown to recover multiple plausible modes under uncertainty, leading to better performance than regression-based methods. The approach is also demonstrated on self-captured data, showing its effectiveness in real-world scenarios. The paper also discusses related work, including structure-from-motion, SLAM, and pose estimation from sparse views, and presents a detailed method section describing the ray-based camera parametrization and the denoising diffusion model. The experiments show that the proposed method outperforms existing approaches in terms of accuracy and generalization. The paper concludes that the ray-based representation provides a more effective way to recover precise camera poses in sparse-view settings.This paper proposes a novel approach for camera pose estimation using a distributed ray representation, which treats a camera as a bundle of rays. Unlike traditional methods that predict global camera parameters, this approach models camera poses as a set of rays passing through image patches. This representation allows for tighter coupling with spatial image features, improving pose estimation accuracy. The method is based on a regression approach that maps image patches to corresponding rays, and is extended to a denoising diffusion model to handle uncertainties in sparse-view pose inference. The proposed methods demonstrate state-of-the-art performance on the CO3D dataset, generalizing to unseen object categories and in-the-wild captures. The ray-based representation is naturally suited for set-level transformers and allows for efficient inference. The method is evaluated on various metrics, including rotation and camera center accuracy, showing significant improvements over existing approaches. The ray diffusion model is shown to recover multiple plausible modes under uncertainty, leading to better performance than regression-based methods. The approach is also demonstrated on self-captured data, showing its effectiveness in real-world scenarios. The paper also discusses related work, including structure-from-motion, SLAM, and pose estimation from sparse views, and presents a detailed method section describing the ray-based camera parametrization and the denoising diffusion model. The experiments show that the proposed method outperforms existing approaches in terms of accuracy and generalization. The paper concludes that the ray-based representation provides a more effective way to recover precise camera poses in sparse-view settings.