4D Panoptic Scene Graph Generation

4D Panoptic Scene Graph Generation

2023 | Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu
This paper introduces 4D Panoptic Scene Graph (PSG-4D), a novel representation that bridges raw visual data in a dynamic 4D world with high-level visual understanding. PSG-4D captures both spatial and temporal information, including fine-grained semantics in image pixels and temporal relational information. It consists of nodes representing entities with accurate location and status information and edges capturing temporal relations. The authors propose a dataset of 3K RGB-D videos with 1M frames, each labeled with 4D panoptic segmentation masks and dynamic scene graphs. They also introduce PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks over time, and generate scene graphs. The model is evaluated on the new dataset, showing its effectiveness. The authors also demonstrate a real-world application where a large language model is integrated into the PSG-4D system to enable dynamic scene understanding. The paper also discusses related work, including scene graph generation, 3D scene graph generation, and 4D perception research. The authors highlight the importance of depth information and temporal attention in the task. The paper concludes with a discussion of challenges and future directions, including the need for more efficient algorithms and comprehensive datasets. The authors also mention potential societal impacts and acknowledge the support received for this research.This paper introduces 4D Panoptic Scene Graph (PSG-4D), a novel representation that bridges raw visual data in a dynamic 4D world with high-level visual understanding. PSG-4D captures both spatial and temporal information, including fine-grained semantics in image pixels and temporal relational information. It consists of nodes representing entities with accurate location and status information and edges capturing temporal relations. The authors propose a dataset of 3K RGB-D videos with 1M frames, each labeled with 4D panoptic segmentation masks and dynamic scene graphs. They also introduce PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks over time, and generate scene graphs. The model is evaluated on the new dataset, showing its effectiveness. The authors also demonstrate a real-world application where a large language model is integrated into the PSG-4D system to enable dynamic scene understanding. The paper also discusses related work, including scene graph generation, 3D scene graph generation, and 4D perception research. The authors highlight the importance of depth information and temporal attention in the task. The paper concludes with a discussion of challenges and future directions, including the need for more efficient algorithms and comprehensive datasets. The authors also mention potential societal impacts and acknowledge the support received for this research.
Reach us at info@study.space
[slides and audio] 4D Panoptic Scene Graph Generation