From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

24 Apr 2024 | Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He
This paper introduces a novel open-vocabulary scene graph generation (SGG) framework based on vision-language models (VLMs), which enables the generation of scene graphs with both known and novel visual relation triplets from images. The framework, named Pixels to Scene Graph Generation with Generative VLM (PGSG), formulates SGG as an image-to-sequence generation task, leveraging the image-to-text generation capability of pre-trained VLMs. By generating scene graph sequences via image-to-text generation and then constructing scene graphs from these sequences, the framework effectively utilizes the rich visual-linguistic knowledge of VLMs for relation-aware representation without requiring additional pre-training. The framework also introduces a scene graph prompt and a plug-and-play relationship construction module, allowing more efficient model learning. The proposed framework is evaluated on three SGG benchmarks: Panoptic Scene Graph, OpenImages-V6, and Visual Genome, achieving state-of-the-art performance in the general open-vocabulary setting. Furthermore, the framework is applied to multiple vision-language (VL) tasks, demonstrating consistent improvements, highlighting the effectiveness of relational knowledge transfer. The framework's unified approach enables seamless adaptation to various VL tasks by leveraging the fine-tuned VLM as an initialization. The method also addresses the challenges of open-vocabulary SGG by generating scene graphs with unseen predicate concepts, enhancing the performance of downstream VL tasks through explicit relation modeling. The framework's design allows for efficient training and inference, achieving a good trade-off between performance and computational cost. The results show that the proposed method outperforms existing approaches in both open-vocabulary and close-vocabulary SGG settings, demonstrating its effectiveness in generating accurate and comprehensive scene graphs.This paper introduces a novel open-vocabulary scene graph generation (SGG) framework based on vision-language models (VLMs), which enables the generation of scene graphs with both known and novel visual relation triplets from images. The framework, named Pixels to Scene Graph Generation with Generative VLM (PGSG), formulates SGG as an image-to-sequence generation task, leveraging the image-to-text generation capability of pre-trained VLMs. By generating scene graph sequences via image-to-text generation and then constructing scene graphs from these sequences, the framework effectively utilizes the rich visual-linguistic knowledge of VLMs for relation-aware representation without requiring additional pre-training. The framework also introduces a scene graph prompt and a plug-and-play relationship construction module, allowing more efficient model learning. The proposed framework is evaluated on three SGG benchmarks: Panoptic Scene Graph, OpenImages-V6, and Visual Genome, achieving state-of-the-art performance in the general open-vocabulary setting. Furthermore, the framework is applied to multiple vision-language (VL) tasks, demonstrating consistent improvements, highlighting the effectiveness of relational knowledge transfer. The framework's unified approach enables seamless adaptation to various VL tasks by leveraging the fine-tuned VLM as an initialization. The method also addresses the challenges of open-vocabulary SGG by generating scene graphs with unseen predicate concepts, enhancing the performance of downstream VL tasks through explicit relation modeling. The framework's design allows for efficient training and inference, achieving a good trade-off between performance and computational cost. The results show that the proposed method outperforms existing approaches in both open-vocabulary and close-vocabulary SGG settings, demonstrating its effectiveness in generating accurate and comprehensive scene graphs.
Reach us at info@study.space
[slides and audio] From Pixels to Graphs%3A Open-Vocabulary Scene Graph Generation with Vision-Language Models