The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

23 Aug 2024 | Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, and Jifeng Dai
The All-Seeing Project V2 introduces a new model and dataset for understanding object relations in images. The All-Seeing Model V2 (ASMv2) integrates text generation, object localization, and relation comprehension into a Relation Conversation (ReC) task, enabling the model to perceive and recognize all objects in an image and grasp the intricate relation graph between them. This approach reduces relation hallucination in Multi-modal Large Language Models (MLLMs). The project also creates the first high-quality ReC dataset (AS-V2) aligned with standard instruction tuning data and introduces the Circular-based Relation Probing Evaluation (CRPE) benchmark for evaluating relation comprehension. ASMv2 achieves an overall accuracy of 64.50 on CRPE, surpassing LLaVA-1.5 by a large margin. The model excels in various vision-language tasks, including Open-ended Scene Graph Generation, achieving a score of 74.4 on MMBench and 1621.0 on MME. ASMv2 also performs well on region-level tasks like Referring Expression Comprehension and Region Captioning. The model is trained on a diverse set of multimodal corpora, including CC3M, LLaVA-1.5, and AS-V2, and demonstrates strong performance across multiple benchmarks. The project aims to advance artificial general intelligence by enabling models to understand and generate scene graphs in an open-ended manner.The All-Seeing Project V2 introduces a new model and dataset for understanding object relations in images. The All-Seeing Model V2 (ASMv2) integrates text generation, object localization, and relation comprehension into a Relation Conversation (ReC) task, enabling the model to perceive and recognize all objects in an image and grasp the intricate relation graph between them. This approach reduces relation hallucination in Multi-modal Large Language Models (MLLMs). The project also creates the first high-quality ReC dataset (AS-V2) aligned with standard instruction tuning data and introduces the Circular-based Relation Probing Evaluation (CRPE) benchmark for evaluating relation comprehension. ASMv2 achieves an overall accuracy of 64.50 on CRPE, surpassing LLaVA-1.5 by a large margin. The model excels in various vision-language tasks, including Open-ended Scene Graph Generation, achieving a score of 74.4 on MMBench and 1621.0 on MME. ASMv2 also performs well on region-level tasks like Referring Expression Comprehension and Region Captioning. The model is trained on a diverse set of multimodal corpora, including CC3M, LLaVA-1.5, and AS-V2, and demonstrates strong performance across multiple benchmarks. The project aims to advance artificial general intelligence by enabling models to understand and generate scene graphs in an open-ended manner.
Reach us at info@study.space