[slides] SceneScript%3A Reconstructing Scenes With An Autoregressive Structured Language Model

**SceneScript** is a novel method that directly generates full 3D scene representations as a sequence of structured language commands using an autoregressive, token-based approach. Unlike traditional methods that describe scenes as meshes, voxel grids, point clouds, or radiance fields, SceneScript uses a scene language encoder-decoder architecture to infer structured language commands from encoded visual data. The method is trained on a large-scale synthetic dataset called **Aria Synthetic Environments**, which consists of 100k high-quality indoor scenes with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. **SceneScript** offers several advantages: 1. **Compact Representation**: The scene representation is compact, reducing the memory requirements to a few bytes. 2. **Crisp and Complete**: The commands are designed to result in sharp and well-defined geometry. 3. **Interpretable and Editable**: The language is semantically rich and can be easily edited. 4. **Seamless Integration**: New geometric entities can be added by simply extending the structured language. The method is evaluated on tasks such as architectural layout estimation and 3D object detection, achieving state-of-the-art results in layout estimation and competitive results in object detection. Additionally, **SceneScript** can be easily extended to new tasks, such as coarse 3D object reconstruction and entity states, by adding new commands to the structured language. The paper also introduces the **Aria Synthetic Environments** dataset, which includes 100k unique high-quality indoor scenes with ground-truth annotations, enabling large-scale training of scene understanding methods. The results demonstrate that **SceneScript** generalizes well to real scenes, showcasing its potential for future applications in scene editing, querying, and interactive scene reconstruction.**SceneScript** is a novel method that directly generates full 3D scene representations as a sequence of structured language commands using an autoregressive, token-based approach. Unlike traditional methods that describe scenes as meshes, voxel grids, point clouds, or radiance fields, SceneScript uses a scene language encoder-decoder architecture to infer structured language commands from encoded visual data. The method is trained on a large-scale synthetic dataset called **Aria Synthetic Environments**, which consists of 100k high-quality indoor scenes with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. **SceneScript** offers several advantages: 1. **Compact Representation**: The scene representation is compact, reducing the memory requirements to a few bytes. 2. **Crisp and Complete**: The commands are designed to result in sharp and well-defined geometry. 3. **Interpretable and Editable**: The language is semantically rich and can be easily edited. 4. **Seamless Integration**: New geometric entities can be added by simply extending the structured language. The method is evaluated on tasks such as architectural layout estimation and 3D object detection, achieving state-of-the-art results in layout estimation and competitive results in object detection. Additionally, **SceneScript** can be easily extended to new tasks, such as coarse 3D object reconstruction and entity states, by adding new commands to the structured language. The paper also introduces the **Aria Synthetic Environments** dataset, which includes 100k unique high-quality indoor scenes with ground-truth annotations, enabling large-scale training of scene understanding methods. The results demonstrate that **SceneScript** generalizes well to real scenes, showcasing its potential for future applications in scene editing, querying, and interactive scene reconstruction.

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

19 Mar 2024 | Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas