21 Aug 2024 | Joseph Tung*, Gene Chou*, Ruojin Cai, Guandao Yang², Kai Zhang³, Gordon Wetzstein², Bharath Hariharan¹, and Noah Snavely¹
MegaScenes is a large-scale scene-level dataset containing over 430,000 scenes, including more than 100,000 structure-from-motion (SfM) reconstructions and over 2 million registered images. The dataset is sourced from Wikimedia Commons and includes diverse scenes such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. It provides 3D annotations, including SIFT keypoints, two-view geometries, sparse point clouds, and camera poses. The dataset is designed to support scene-level novel view synthesis (NVS) and includes a hierarchical class label system derived from Wikidata.
The paper introduces MegaScenes as a solution to the lack of diverse, scene-level data for training 3D-aware models. It addresses the limitations of existing datasets by creating a large-scale 3D dataset from internet photo collections. The dataset is used to train and evaluate state-of-the-art NVS methods, such as Zero-1-to-3 and ZeroNVS, which are shown to perform significantly better on multiple benchmarks when trained on MegaScenes. The paper also proposes an improved method that enhances pose consistency by conditioning on warped images and extrinsic matrices, leading to more consistent and realistic results.
The paper evaluates the effectiveness of MegaScenes on various tasks, including scene-level NVS, and demonstrates that models trained on MegaScenes generalize better to in-the-wild scenes. The results show that MegaScenes significantly improves the performance of NVS methods, particularly in terms of consistency and realism. The dataset, code, and pretrained models are made available for further research. The paper also discusses the limitations of the current methods and the potential for future work, including the incorporation of lighting conditions and handling of large camera motions.MegaScenes is a large-scale scene-level dataset containing over 430,000 scenes, including more than 100,000 structure-from-motion (SfM) reconstructions and over 2 million registered images. The dataset is sourced from Wikimedia Commons and includes diverse scenes such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. It provides 3D annotations, including SIFT keypoints, two-view geometries, sparse point clouds, and camera poses. The dataset is designed to support scene-level novel view synthesis (NVS) and includes a hierarchical class label system derived from Wikidata.
The paper introduces MegaScenes as a solution to the lack of diverse, scene-level data for training 3D-aware models. It addresses the limitations of existing datasets by creating a large-scale 3D dataset from internet photo collections. The dataset is used to train and evaluate state-of-the-art NVS methods, such as Zero-1-to-3 and ZeroNVS, which are shown to perform significantly better on multiple benchmarks when trained on MegaScenes. The paper also proposes an improved method that enhances pose consistency by conditioning on warped images and extrinsic matrices, leading to more consistent and realistic results.
The paper evaluates the effectiveness of MegaScenes on various tasks, including scene-level NVS, and demonstrates that models trained on MegaScenes generalize better to in-the-wild scenes. The results show that MegaScenes significantly improves the performance of NVS methods, particularly in terms of consistency and realism. The dataset, code, and pretrained models are made available for further research. The paper also discusses the limitations of the current methods and the potential for future work, including the incorporation of lighting conditions and handling of large camera motions.