From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

24 Apr 2024 | Rongjie Li1*, Songyang Zhang2*, Dahua Lin2, Kai Chen2‡, Xuming He1,3‡
The paper introduces a novel framework for open-vocabulary scene graph generation (SGG) using vision-language models (VLMs). The framework, named Pixels to Scene Graph Generation with Generative VLM (PGSG), leverages image-to-text generation to generate scene graph sequences and then constructs scene graphs from these sequences. This approach allows the model to handle novel visual relation concepts and integrates explicit relational modeling to enhance downstream vision-language tasks. The framework consists of three main components: scene graph prompts, a relationship construction module, and a fine-tuning strategy based on VLMs. Experimental results on three SGG benchmarks (Panoptic Scene Graph, OpenImages-V6, and Visual Genome) demonstrate superior performance in the general open-vocabulary setting. Additionally, the framework is applied to multiple vision-language tasks, showing consistent improvements. The contributions of the work include a novel framework for open-vocabulary SGG, efficient model learning through scene graph prompts and relation-aware conversion, and superior performance on SGG benchmarks and downstream tasks.The paper introduces a novel framework for open-vocabulary scene graph generation (SGG) using vision-language models (VLMs). The framework, named Pixels to Scene Graph Generation with Generative VLM (PGSG), leverages image-to-text generation to generate scene graph sequences and then constructs scene graphs from these sequences. This approach allows the model to handle novel visual relation concepts and integrates explicit relational modeling to enhance downstream vision-language tasks. The framework consists of three main components: scene graph prompts, a relationship construction module, and a fine-tuning strategy based on VLMs. Experimental results on three SGG benchmarks (Panoptic Scene Graph, OpenImages-V6, and Visual Genome) demonstrate superior performance in the general open-vocabulary setting. Additionally, the framework is applied to multiple vision-language tasks, showing consistent improvements. The contributions of the work include a novel framework for open-vocabulary SGG, efficient model learning through scene graph prompts and relation-aware conversion, and superior performance on SGG benchmarks and downstream tasks.
Reach us at info@study.space