21 Aug 2024 | Joseph Tung*1, Gene Chou*1, Ruojin Cai1, Guandao Yang2, Kai Zhang3, Gordon Wetzstein2, Bharath Hariharan1, and Noah Snavely1
**MegaScenes: Scene-Level View Synthesis at Scale**
**Authors:** Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely
**Institutions:** Cornell University, Stanford University, Adobe Research
**Abstract:**
Scene-level novel view synthesis (NVS) is crucial for various vision and graphics applications. Pose-conditioned diffusion models have shown significant progress by extracting 3D information from 2D foundation models, but they are limited by the lack of scene-level training data. Common datasets either consist of isolated objects or object-centric scenes with limited pose distributions. This paper introduces MegaScenes, a large-scale scene-level dataset created from Internet photo collections, containing over 100K structure from motion (SfM) reconstructions from around the world. MegaScenes includes a diverse array of scenes, such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes, captured under varying conditions. The dataset addresses challenges like lighting and transient objects, and is used to improve NVS methods. Extensive experiments validate the effectiveness of both the dataset and the method on generating in-the-wild scenes.
**Keywords:** Novel view synthesis of scenes, Pose-conditioned diffusion models, Dataset of Internet photo collections
**Introduction:**
Scene-level NVS is fundamental for many vision and graphics applications. Current state-of-the-art methods use 2D diffusion models trained on large internet datasets and finetune them on multiview images with camera poses. However, these methods are limited to object-level synthesis and struggle with realistic, in-the-wild scenes due to the small size and lack of diversity of existing scene-level datasets. MegaScenes, sourced from Wikimedia Commons, provides a diverse and large-scale 3D dataset, covering a wide range of scenes and categories. The dataset is used to train and evaluate NVS models, demonstrating improved performance on multiple benchmarks.
**Dataset Characteristics:**
- **Wikimedia Commons Categories as Scenes:** Each scene in MegaScenes is derived from a single Wikimedia Commons category.
- **Images, Subcategorization, and Licensing:** Images within a scene are classified into subcategories, enabling future applications and data cleaning.
- **3D Data:** SIFT keypoints, descriptors, and two-view geometries are provided for each scene.
- **Class Hierarchy:** A hierarchical class label system is included for each scene, aiding in dataset curation.
**Dataset Curation:**
- **Identifying Scenes:** Scenes are identified using Wikidata class hierarchies.
- **Downloading Images:** Images are downloaded from identified scenes, ensuring sufficient visual overlap and similar lighting conditions.
- **Reconstructing Scenes:** SfM is used to reconstruct scenes, and reconstructions are cleaned using the Doppelgangers pipeline.
**Evaluation:**
- **Data Mining and Evaluation**MegaScenes: Scene-Level View Synthesis at Scale**
**Authors:** Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely
**Institutions:** Cornell University, Stanford University, Adobe Research
**Abstract:**
Scene-level novel view synthesis (NVS) is crucial for various vision and graphics applications. Pose-conditioned diffusion models have shown significant progress by extracting 3D information from 2D foundation models, but they are limited by the lack of scene-level training data. Common datasets either consist of isolated objects or object-centric scenes with limited pose distributions. This paper introduces MegaScenes, a large-scale scene-level dataset created from Internet photo collections, containing over 100K structure from motion (SfM) reconstructions from around the world. MegaScenes includes a diverse array of scenes, such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes, captured under varying conditions. The dataset addresses challenges like lighting and transient objects, and is used to improve NVS methods. Extensive experiments validate the effectiveness of both the dataset and the method on generating in-the-wild scenes.
**Keywords:** Novel view synthesis of scenes, Pose-conditioned diffusion models, Dataset of Internet photo collections
**Introduction:**
Scene-level NVS is fundamental for many vision and graphics applications. Current state-of-the-art methods use 2D diffusion models trained on large internet datasets and finetune them on multiview images with camera poses. However, these methods are limited to object-level synthesis and struggle with realistic, in-the-wild scenes due to the small size and lack of diversity of existing scene-level datasets. MegaScenes, sourced from Wikimedia Commons, provides a diverse and large-scale 3D dataset, covering a wide range of scenes and categories. The dataset is used to train and evaluate NVS models, demonstrating improved performance on multiple benchmarks.
**Dataset Characteristics:**
- **Wikimedia Commons Categories as Scenes:** Each scene in MegaScenes is derived from a single Wikimedia Commons category.
- **Images, Subcategorization, and Licensing:** Images within a scene are classified into subcategories, enabling future applications and data cleaning.
- **3D Data:** SIFT keypoints, descriptors, and two-view geometries are provided for each scene.
- **Class Hierarchy:** A hierarchical class label system is included for each scene, aiding in dataset curation.
**Dataset Curation:**
- **Identifying Scenes:** Scenes are identified using Wikidata class hierarchies.
- **Downloading Images:** Images are downloaded from identified scenes, ensuring sufficient visual overlap and similar lighting conditions.
- **Reconstructing Scenes:** SfM is used to reconstruct scenes, and reconstructions are cleaned using the Doppelgangers pipeline.
**Evaluation:**
- **Data Mining and Evaluation