This paper introduces ChronoDepth, a video depth estimator that prioritizes temporal consistency by leveraging video generation priors. The main challenge in video depth estimation is achieving both spatial accuracy and temporal consistency across frames. Instead of developing a depth estimator from scratch, the authors reformulate the task as a conditional generation problem, allowing them to utilize the prior knowledge embedded in existing video generation models. This approach reduces learning difficulty and enhances generalizability.
The authors study how to adapt the Stable Video Diffusion (SVD) model to predict reliable depth from input videos using a mixture of image and video depth datasets. They propose a procedural training strategy, first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen. This strategy yields the best results in terms of both spatial accuracy and temporal consistency.
For inference, the authors use a sliding window strategy, using previously predicted depth frames to guide the prediction of subsequent frames, yielding consistent depth estimation over arbitrarily long videos. They observe that a one-frame overlap between consecutive videos already produces favorable results, balancing efficiency and performance.
The authors also explore two inference strategies: separate inference, where videos are divided into non-overlapping clips and predicted individually, and temporal inpainting inference, where later frames are inpainted based on previous frames' predictions. The latter enhances temporal consistency.
Extensive experimental results demonstrate that ChronoDepth outperforms existing alternatives, particularly in terms of temporal consistency of the estimated depth. The authors also highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.
The project page is available at https://jhaoshao.github.io/ChronoDepth/.This paper introduces ChronoDepth, a video depth estimator that prioritizes temporal consistency by leveraging video generation priors. The main challenge in video depth estimation is achieving both spatial accuracy and temporal consistency across frames. Instead of developing a depth estimator from scratch, the authors reformulate the task as a conditional generation problem, allowing them to utilize the prior knowledge embedded in existing video generation models. This approach reduces learning difficulty and enhances generalizability.
The authors study how to adapt the Stable Video Diffusion (SVD) model to predict reliable depth from input videos using a mixture of image and video depth datasets. They propose a procedural training strategy, first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen. This strategy yields the best results in terms of both spatial accuracy and temporal consistency.
For inference, the authors use a sliding window strategy, using previously predicted depth frames to guide the prediction of subsequent frames, yielding consistent depth estimation over arbitrarily long videos. They observe that a one-frame overlap between consecutive videos already produces favorable results, balancing efficiency and performance.
The authors also explore two inference strategies: separate inference, where videos are divided into non-overlapping clips and predicted individually, and temporal inpainting inference, where later frames are inpainted based on previous frames' predictions. The latter enhances temporal consistency.
Extensive experimental results demonstrate that ChronoDepth outperforms existing alternatives, particularly in terms of temporal consistency of the estimated depth. The authors also highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.
The project page is available at https://jhaoshao.github.io/ChronoDepth/.