[slides and audio] UnO%3A Unsupervised Occupancy Fields for Perception and Forecasting

**Abstract:** Perceiving and forecasting the world's future state is crucial for autonomous driving. Traditional methods rely on annotated object labels, which are costly and limited to predefined categories. This paper introduces UNO (Unsupervised Occupancy), a world model that learns to predict 3D occupancy over time from unlabeled LiDAR data. UNO can be easily transferred to downstream tasks such as point cloud forecasting and bird’s-eye view semantic occupancy. The model achieves state-of-the-art performance in Argoverse 2, nuScenes, and KITTI for point cloud forecasting and outperforms fully supervised methods in BEV semantic occupancy prediction, especially with limited labeled data. UNO also demonstrates superior recall of objects relevant to self-driving, outperforming prior state-of-the-art methods. **Introduction:** The paper addresses the challenges of perceiving and forecasting the environment for autonomous driving. Traditional methods use supervised learning with object labels, but these labels are expensive and limited. UNO learns a continuous 4D occupancy field from LiDAR data, leveraging self-supervision. The model can be queried at any continuous point in space and time, capturing geometry, dynamics, and semantics. It is designed to handle complex shapes and dynamic objects accurately. **Related Work:** The paper reviews related work in point cloud forecasting, occupancy forecasting from sensors, and pre-training for perception. UNO differs from prior work by learning a continuous 4D occupancy field and decoupling the training of the occupancy model from the point cloud renderer. **Unsupervised Occupancy World Model:** UNO models 4D occupancy as a continuous field, allowing for fine-grained spatial and temporal resolution. The model uses a 4D implicit occupancy forecasting architecture to predict occupancy at any query point. Training involves generating pseudo-labels from future LiDAR data and using a binary cross-entropy loss. **Transferring UNO to Downstream Tasks:** UNO is transferred to point cloud forecasting by learning a lightweight neural network to predict ray depth from UNO-estimated occupancy values. For BEV semantic occupancy forecasting, UNO is fine-tuned on a small set of labeled data, leveraging semantic object annotations. **Experiments:** UNO is evaluated on multiple datasets, showing significant improvements over state-of-the-art methods in point cloud forecasting and BEV semantic occupancy prediction. It also demonstrates superior geometric occupancy recall for relevant object classes. **Conclusion:** UNO is a powerful unsupervised occupancy world model that learns to predict 4D geometric occupancy from past LiDAR data. It can be effectively transferred to downstream tasks, demonstrating its effectiveness in autonomous driving applications.**Abstract:** Perceiving and forecasting the world's future state is crucial for autonomous driving. Traditional methods rely on annotated object labels, which are costly and limited to predefined categories. This paper introduces UNO (Unsupervised Occupancy), a world model that learns to predict 3D occupancy over time from unlabeled LiDAR data. UNO can be easily transferred to downstream tasks such as point cloud forecasting and bird’s-eye view semantic occupancy. The model achieves state-of-the-art performance in Argoverse 2, nuScenes, and KITTI for point cloud forecasting and outperforms fully supervised methods in BEV semantic occupancy prediction, especially with limited labeled data. UNO also demonstrates superior recall of objects relevant to self-driving, outperforming prior state-of-the-art methods. **Introduction:** The paper addresses the challenges of perceiving and forecasting the environment for autonomous driving. Traditional methods use supervised learning with object labels, but these labels are expensive and limited. UNO learns a continuous 4D occupancy field from LiDAR data, leveraging self-supervision. The model can be queried at any continuous point in space and time, capturing geometry, dynamics, and semantics. It is designed to handle complex shapes and dynamic objects accurately. **Related Work:** The paper reviews related work in point cloud forecasting, occupancy forecasting from sensors, and pre-training for perception. UNO differs from prior work by learning a continuous 4D occupancy field and decoupling the training of the occupancy model from the point cloud renderer. **Unsupervised Occupancy World Model:** UNO models 4D occupancy as a continuous field, allowing for fine-grained spatial and temporal resolution. The model uses a 4D implicit occupancy forecasting architecture to predict occupancy at any query point. Training involves generating pseudo-labels from future LiDAR data and using a binary cross-entropy loss. **Transferring UNO to Downstream Tasks:** UNO is transferred to point cloud forecasting by learning a lightweight neural network to predict ray depth from UNO-estimated occupancy values. For BEV semantic occupancy forecasting, UNO is fine-tuned on a small set of labeled data, leveraging semantic object annotations. **Experiments:** UNO is evaluated on multiple datasets, showing significant improvements over state-of-the-art methods in point cloud forecasting and BEV semantic occupancy prediction. It also demonstrates superior geometric occupancy recall for relevant object classes. **Conclusion:** UNO is a powerful unsupervised occupancy world model that learns to predict 4D geometric occupancy from past LiDAR data. It can be effectively transferred to downstream tasks, demonstrating its effectiveness in autonomous driving applications.

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

12 Jun 2024 | Ben Agro1,2, Quinlan Sykora1,2, Sergio Casas*1,2, Thomas Gilles1, Raquel Urtasun1,2

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

12 Jun 2024 | Ben Agro*1,2, Quinlan Sykora*1,2, Sergio Casas*1,2, Thomas Gilles1, Raquel Urtasun1,2

12 Jun 2024 | Ben Agro1,2, Quinlan Sykora1,2, Sergio Casas*1,2, Thomas Gilles1, Raquel Urtasun1,2