Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

1 Apr 2024 | Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Marc Pollefeys, Martin R. Oswald
Sat2Scene is a novel method for generating 3D urban scenes directly from satellite images using diffusion models and neural rendering. The method addresses the challenges of generating photorealistic street-view images and cross-view urban scenes from satellite imagery, which is difficult due to significant view differences and large scene scales. The proposed architecture generates texture colors at the point level using a 3D diffusion model, which is then transformed into a scene representation. This representation can be used to render arbitrary views with high single-frame quality and inter-frame consistency. The method outperforms existing approaches in terms of overall video quality and temporal consistency, as demonstrated by experiments on two city-scale datasets, HoliCity and OmniCity. The model is capable of generating photorealistic street-view image sequences with robust temporal consistency and has been shown to generalize well to new datasets. The method combines diffusion models with 3D sparse representations and neural rendering techniques to directly generate 3D scene representations. The model is trained on a single NVIDIA Tesla A100 GPU with 40GB memory and achieves high-quality results in both generation and rendering. The method is evaluated using quantitative metrics such as FID, KID, FVD, and KVD, as well as qualitative assessments. The results show that the model produces higher-quality videos with better temporal consistency compared to existing baselines. The method is also able to generate images from arbitrary views, demonstrating its adaptability for practical applications. The model's ability to generate photorealistic images and maintain consistency across frames makes it a promising approach for urban scene generation from satellite imagery.Sat2Scene is a novel method for generating 3D urban scenes directly from satellite images using diffusion models and neural rendering. The method addresses the challenges of generating photorealistic street-view images and cross-view urban scenes from satellite imagery, which is difficult due to significant view differences and large scene scales. The proposed architecture generates texture colors at the point level using a 3D diffusion model, which is then transformed into a scene representation. This representation can be used to render arbitrary views with high single-frame quality and inter-frame consistency. The method outperforms existing approaches in terms of overall video quality and temporal consistency, as demonstrated by experiments on two city-scale datasets, HoliCity and OmniCity. The model is capable of generating photorealistic street-view image sequences with robust temporal consistency and has been shown to generalize well to new datasets. The method combines diffusion models with 3D sparse representations and neural rendering techniques to directly generate 3D scene representations. The model is trained on a single NVIDIA Tesla A100 GPU with 40GB memory and achieves high-quality results in both generation and rendering. The method is evaluated using quantitative metrics such as FID, KID, FVD, and KVD, as well as qualitative assessments. The results show that the model produces higher-quality videos with better temporal consistency compared to existing baselines. The method is also able to generate images from arbitrary views, demonstrating its adaptability for practical applications. The model's ability to generate photorealistic images and maintain consistency across frames makes it a promising approach for urban scene generation from satellite imagery.
Reach us at info@study.space
Understanding Sat2Scene%3A 3D Urban Scene Generation from Satellite Images with Diffusion