[slides and audio] Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

The paper introduces P²G, a novel framework for plug-and-play grounding in Multimodal Large Language Models (MLLMs), which enhances their visual reasoning capabilities. P²G leverages external agents, such as OCR and visual grounding agents, to provide critical textual and visual clues, enabling MLLMs to perform more accurate and grounded reasoning. The authors also develop P²GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in high-resolution images. Extensive experiments demonstrate that P²G outperforms existing methods, achieving comparable performance to GPT-4V on P²GB with a 7B backbone. The work highlights the potential of using external agents for grounding reasoning in MLLMs, offering a promising alternative to model scaling.The paper introduces P²G, a novel framework for plug-and-play grounding in Multimodal Large Language Models (MLLMs), which enhances their visual reasoning capabilities. P²G leverages external agents, such as OCR and visual grounding agents, to provide critical textual and visual clues, enabling MLLMs to perform more accurate and grounded reasoning. The authors also develop P²GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in high-resolution images. Extensive experiments demonstrate that P²G outperforms existing methods, achieving comparable performance to GPT-4V on P²GB with a 7B backbone. The work highlights the potential of using external agents for grounding reasoning in MLLMs, offering a promising alternative to model scaling.

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

18 Jun 2024 | Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Weimo Deng, Ziyong Feng, Yongle Zhao, Yin Xie