July 27-August 1, 2024, Denver, CO, USA | Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely*, Gordon Wetzstein*
The paper presents a method for generating *Streetscapes*, which are long sequences of consistent street views through synthesized urban scenes. The system is conditioned on language input (e.g., city name, weather) and an underlying map/layout that hosts the desired trajectory. Compared to existing video generation models, the method can scale to longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. The key contributions include a layout-conditioned generation approach, a motion module for consistent two-frame generation, and an autoregressive *temporal imputation* technique to ensure long-range consistency. The system is trained on Google Street View imagery and corresponding map data, leveraging the coarse-grained but globally extensive nature of this data to achieve robust and controllable generation. The results demonstrate high-quality, realistic Streetscapes with flexible control over scene layout, camera poses, and scene conditions, showcasing the system's effectiveness in various applications.The paper presents a method for generating *Streetscapes*, which are long sequences of consistent street views through synthesized urban scenes. The system is conditioned on language input (e.g., city name, weather) and an underlying map/layout that hosts the desired trajectory. Compared to existing video generation models, the method can scale to longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. The key contributions include a layout-conditioned generation approach, a motion module for consistent two-frame generation, and an autoregressive *temporal imputation* technique to ensure long-range consistency. The system is trained on Google Street View imagery and corresponding map data, leveraging the coarse-grained but globally extensive nature of this data to achieve robust and controllable generation. The results demonstrate high-quality, realistic Streetscapes with flexible control over scene layout, camera poses, and scene conditions, showcasing the system's effectiveness in various applications.