SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

19 Mar 2024 | Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvar Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas
SceneScript is a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Inspired by recent successes in transformers and large language models (LLMs), SceneScript departs from traditional methods that commonly describe scenes as meshes, voxel grids, point clouds, or radiance fields. Instead, it infers structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, a large-scale synthetic dataset called Aria Synthetic Environments (ASE) was generated, consisting of 100,000 high-quality indoor scenes with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. SceneScript achieves state-of-the-art results in architectural layout estimation and competitive results in 3D object detection. A key advantage of SceneScript is its ability to easily adapt to new commands, enabling it to represent novel scene entities. For example, by introducing a single new command, SceneScript can directly predict object parts jointly with the layout and bounding boxes. SceneScript's structured language commands include make_wall, make_door, make_window, and make_bbox, which define a full scene representation. The method is trained on ASE and generalizes well to real scenes. SceneScript's architecture is a simple encoder-decoder that consumes a video sequence and returns SceneScript language in a tokenized format. The encoder processes video walkthroughs and the decoder generates structured language commands. SceneScript's structured language is compact, interpretable, and extensible, allowing for easy adaptation to new tasks. The method demonstrates strong performance in architectural layout estimation and 3D object detection, and can be extended to new tasks with simple additions to the structured language. SceneScript's structured language commands are designed to be easily extended to include states or other functional aspects. The method is evaluated on Aria Synthetic Environments and ScanNet, showing competitive performance in 3D object detection. SceneScript's ability to reconstruct objects of multiple categories through the addition of a new command demonstrates its extensibility. The method also enables interactive scene reconstruction by streaming live reconstructions into a VR headset. SceneScript's structured language representation is compact, editable, and interpretable, and can be extended to arbitrary elements with minimal changes. The research opens up new directions in representing 3D scenes as language, bringing the 3D reconstruction community closer to recent advances in large language models.SceneScript is a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Inspired by recent successes in transformers and large language models (LLMs), SceneScript departs from traditional methods that commonly describe scenes as meshes, voxel grids, point clouds, or radiance fields. Instead, it infers structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, a large-scale synthetic dataset called Aria Synthetic Environments (ASE) was generated, consisting of 100,000 high-quality indoor scenes with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. SceneScript achieves state-of-the-art results in architectural layout estimation and competitive results in 3D object detection. A key advantage of SceneScript is its ability to easily adapt to new commands, enabling it to represent novel scene entities. For example, by introducing a single new command, SceneScript can directly predict object parts jointly with the layout and bounding boxes. SceneScript's structured language commands include make_wall, make_door, make_window, and make_bbox, which define a full scene representation. The method is trained on ASE and generalizes well to real scenes. SceneScript's architecture is a simple encoder-decoder that consumes a video sequence and returns SceneScript language in a tokenized format. The encoder processes video walkthroughs and the decoder generates structured language commands. SceneScript's structured language is compact, interpretable, and extensible, allowing for easy adaptation to new tasks. The method demonstrates strong performance in architectural layout estimation and 3D object detection, and can be extended to new tasks with simple additions to the structured language. SceneScript's structured language commands are designed to be easily extended to include states or other functional aspects. The method is evaluated on Aria Synthetic Environments and ScanNet, showing competitive performance in 3D object detection. SceneScript's ability to reconstruct objects of multiple categories through the addition of a new command demonstrates its extensibility. The method also enables interactive scene reconstruction by streaming live reconstructions into a VR headset. SceneScript's structured language representation is compact, editable, and interpretable, and can be extended to arbitrary elements with minimal changes. The research opens up new directions in representing 3D scenes as language, bringing the 3D reconstruction community closer to recent advances in large language models.
Reach us at info@study.space