[slides] Streetscapes%3A Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion This paper presents a method for generating long sequences of views through a synthesized city-scale scene, called Streetscapes. The method is conditioned on language input (e.g., city name, weather) and an underlying map/layout that hosts the desired trajectory. Unlike recent video generation or 3D view synthesis models, this method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. The method is based on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. A new temporal imputation method is introduced to prevent the autoregressive approach from drifting from the distribution of realistic city imagery. The system is trained on a compelling source of data—posed imagery from Google Street View, along with contextual map data—which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. The system generates high-quality street views spanning a long camera path with layout controlled by the map, and can generate different geographic styles as well as different weather conditions or times of day, both controlled by text prompts. The system is trained on a novel data source, namely street view imagery and corresponding map information. The system first trains a diffusion model that jointly generates two frames by iteratively denoising two random noise images. This model also takes as input conditioning information rendered from the given layout for two camera views. The goal is to generate many consistent frames in the output, however, not just two. For this purpose, the pre-trained two-frame generation model is modified to allow it to operate in an autoregressive temporal imputation mode without the need for retraining the model. In this mode, the two random noise images used as input to the model are replaced by noised versions of the frame generated for the current camera view and of the current frame warped into the next camera view, respectively. With the generated frames, an optional 3D reconstruction can be run to get a 3D scene model. The system is evaluated on several generative tasks, including long-range consistent street view generation and perpetual view generation. The results show that the system generates high-quality, consistent street views that are more realistic than existing methods. The system also enables numerous creative scene generation applications, thanks to its flexible control over scene layout, camera poses, and scene conditions. The system is the first to adopt imputation techniques, specifically, temporal imputation, to autoregressive video generation, using a two-frame diffusion model.Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion This paper presents a method for generating long sequences of views through a synthesized city-scale scene, called Streetscapes. The method is conditioned on language input (e.g., city name, weather) and an underlying map/layout that hosts the desired trajectory. Unlike recent video generation or 3D view synthesis models, this method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. The method is based on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. A new temporal imputation method is introduced to prevent the autoregressive approach from drifting from the distribution of realistic city imagery. The system is trained on a compelling source of data—posed imagery from Google Street View, along with contextual map data—which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. The system generates high-quality street views spanning a long camera path with layout controlled by the map, and can generate different geographic styles as well as different weather conditions or times of day, both controlled by text prompts. The system is trained on a novel data source, namely street view imagery and corresponding map information. The system first trains a diffusion model that jointly generates two frames by iteratively denoising two random noise images. This model also takes as input conditioning information rendered from the given layout for two camera views. The goal is to generate many consistent frames in the output, however, not just two. For this purpose, the pre-trained two-frame generation model is modified to allow it to operate in an autoregressive temporal imputation mode without the need for retraining the model. In this mode, the two random noise images used as input to the model are replaced by noised versions of the frame generated for the current camera view and of the current frame warped into the next camera view, respectively. With the generated frames, an optional 3D reconstruction can be run to get a 3D scene model. The system is evaluated on several generative tasks, including long-range consistent street view generation and perpetual view generation. The results show that the system generates high-quality, consistent street views that are more realistic than existing methods. The system also enables numerous creative scene generation applications, thanks to its flexible control over scene layout, camera poses, and scene conditions. The system is the first to adopt imputation techniques, specifically, temporal imputation, to autoregressive video generation, using a two-frame diffusion model.

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

July 27-August 1, 2024 | Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely*, Gordon Wetzstein*

July 27-August 1, 2024 | Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein