The paper introduces the All-Seeing Project V2, a new model and dataset designed to enhance the understanding of object relations in images. The project aims to address the limitations of existing Multi-modal Large Language Models (MLLMs) in comprehending and generating scene graphs, particularly in capturing intricate object relationships. Key contributions include:
1. **All-Seeing Model V2 (ASMv2)**: A novel model that integrates text generation, object localization, and relation comprehension into a unified relation conversation (ReC) task. ASMv2 excels in perceiving and recognizing objects within images and grasping their complex relationships, reducing relation hallucination.
2. **All-Seeing Dataset V2 (AS-V2)**: A high-quality dataset for training MLLMs, consisting of over 127K samples for ReC. The dataset is built on existing caption, location, and relation annotations, enhancing the model's ability to handle various vision-language tasks.
3. **Circular-based Relation Probing Evaluation (CRPE)**: A benchmark designed to evaluate the relation comprehension capabilities of MLLMs. CRPE consists of four splits: Existence, Subject, Predicate, and Object, providing a systematic platform for evaluating relation comprehension.
The paper also discusses the training process of ASMv2, which involves two stages: pre-training and instruction-tuning. Pre-training uses a blend of CC3M and LLaVA-1.5 datasets, while instruction-tuning employs a mixture of image-level and region-level data. The model is evaluated on various benchmarks, including multi-modal, region-level, and open-ended scene graph generation tasks, demonstrating superior performance compared to current state-of-the-art models.
The authors hope that their work will inspire further research and contribute to the development of artificial general intelligence, equipping AI systems with a deeper understanding of the world.The paper introduces the All-Seeing Project V2, a new model and dataset designed to enhance the understanding of object relations in images. The project aims to address the limitations of existing Multi-modal Large Language Models (MLLMs) in comprehending and generating scene graphs, particularly in capturing intricate object relationships. Key contributions include:
1. **All-Seeing Model V2 (ASMv2)**: A novel model that integrates text generation, object localization, and relation comprehension into a unified relation conversation (ReC) task. ASMv2 excels in perceiving and recognizing objects within images and grasping their complex relationships, reducing relation hallucination.
2. **All-Seeing Dataset V2 (AS-V2)**: A high-quality dataset for training MLLMs, consisting of over 127K samples for ReC. The dataset is built on existing caption, location, and relation annotations, enhancing the model's ability to handle various vision-language tasks.
3. **Circular-based Relation Probing Evaluation (CRPE)**: A benchmark designed to evaluate the relation comprehension capabilities of MLLMs. CRPE consists of four splits: Existence, Subject, Predicate, and Object, providing a systematic platform for evaluating relation comprehension.
The paper also discusses the training process of ASMv2, which involves two stages: pre-training and instruction-tuning. Pre-training uses a blend of CC3M and LLaVA-1.5 datasets, while instruction-tuning employs a mixture of image-level and region-level data. The model is evaluated on various benchmarks, including multi-modal, region-level, and open-ended scene graph generation tasks, demonstrating superior performance compared to current state-of-the-art models.
The authors hope that their work will inspire further research and contribute to the development of artificial general intelligence, equipping AI systems with a deeper understanding of the world.