Understanding Learning Temporally Consistent Video Depth from Video Diffusion Priors

This paper addresses the challenge of video depth estimation, focusing on both per-frame accuracy and cross-frame consistency. Instead of developing a depth estimator from scratch, the authors reformulate the problem as a conditional generation task, leveraging existing video generation models to reduce learning difficulty and enhance generalizability. They specifically study how to use the Stable Video Diffusion (SVD) model to predict reliable depth from input videos by combining image and video depth datasets. The authors find that a procedural training strategy, which first optimizes the spatial layers of SVD and then optimizes the temporal layers while keeping the spatial layers frozen, yields the best results in terms of both spatial accuracy and temporal consistency. They also examine the sliding window strategy for inference on long videos, finding that a one-frame overlap already produces favorable results. Extensive experimental results demonstrate that their approach, termed ChronoDepth, outperforms existing methods in terms of temporal consistency, while maintaining comparable spatial accuracy. Additionally, they highlight the benefits of temporally consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.This paper addresses the challenge of video depth estimation, focusing on both per-frame accuracy and cross-frame consistency. Instead of developing a depth estimator from scratch, the authors reformulate the problem as a conditional generation task, leveraging existing video generation models to reduce learning difficulty and enhance generalizability. They specifically study how to use the Stable Video Diffusion (SVD) model to predict reliable depth from input videos by combining image and video depth datasets. The authors find that a procedural training strategy, which first optimizes the spatial layers of SVD and then optimizes the temporal layers while keeping the spatial layers frozen, yields the best results in terms of both spatial accuracy and temporal consistency. They also examine the sliding window strategy for inference on long videos, finding that a one-frame overlap already produces favorable results. Extensive experimental results demonstrate that their approach, termed ChronoDepth, outperforms existing methods in terms of temporal consistency, while maintaining comparable spatial accuracy. Additionally, they highlight the benefits of temporally consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.

Learning Temporally Consistent Video Depth from Video Diffusion Priors

4 Jun 2024 | Jiahao Shao1*, Yuanbo Yang1*, Hongyu Zhou1, Youmin Zhang2,4, Yujun Shen3, Matteo Poggi2, Yiyi Liao1†

4 Jun 2024 | Jiahao Shao1, Yuanbo Yang1, Hongyu Zhou1, Youmin Zhang2,4, Yujun Shen3, Matteo Poggi2, Yiyi Liao1†