18 Jun 2024 | Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Weimo Deng, Ziyong Feng, Yongle Zhao, Yin Xie
This paper introduces P²G, a novel framework for plug-and-play grounding of reasoning in Multimodal Large Language Models (MLLMs). P²G addresses the limitations of existing MLLMs in visual reasoning by leveraging external agents to obtain critical textual and visual clues for reasoning. The framework enables MLLMs to generate grounded reasoning through multimodal prompting, and it is evaluated on a new benchmark, P²GB, which assesses MLLMs' ability to reason in high-resolution and text-rich scenarios. P²G achieves performance comparable to GPT-4V on P²GB with a 7B backbone, demonstrating its effectiveness in enhancing visual reasoning capabilities. The framework also introduces a new benchmark, P²GB, which includes challenging visual reasoning tasks with high-resolution and text-rich images. The paper also discusses the limitations of the current work, including noise in agents, token count, and modality-interleaved reasoning. The authors conclude that P²G provides a promising alternative to model scaling for enhancing MLLM reasoning capabilities through plug-and-play grounding.This paper introduces P²G, a novel framework for plug-and-play grounding of reasoning in Multimodal Large Language Models (MLLMs). P²G addresses the limitations of existing MLLMs in visual reasoning by leveraging external agents to obtain critical textual and visual clues for reasoning. The framework enables MLLMs to generate grounded reasoning through multimodal prompting, and it is evaluated on a new benchmark, P²GB, which assesses MLLMs' ability to reason in high-resolution and text-rich scenarios. P²G achieves performance comparable to GPT-4V on P²GB with a 7B backbone, demonstrating its effectiveness in enhancing visual reasoning capabilities. The framework also introduces a new benchmark, P²GB, which includes challenging visual reasoning tasks with high-resolution and text-rich images. The paper also discusses the limitations of the current work, including noise in agents, token count, and modality-interleaved reasoning. The authors conclude that P²G provides a promising alternative to model scaling for enhancing MLLM reasoning capabilities through plug-and-play grounding.