Understanding From Training-Free to Adaptive%3A Empirical Insights into MLLMs' Understanding of Detection Information

This paper presents an empirical study on enhancing Multimodal Large Language Models (MLLMs) with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. The authors investigate the embedding-based infusion of textual detection information, the impact of such infusion on MLLMs' original abilities, and the interchangeability of detection models. They conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO, revealing that their simple yet general approach not only refines MLLMs' performance in fine-grained visual tasks but also maintains their original strengths. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score. The authors release their codes to facilitate further exploration into the fine-grained multimodal capabilities of MLLMs. The paper contributes to the field by providing a thorough empirical investigation of integrating object detection and OCR models into MLLMs, developing a simple yet general approach for fine-grained image understanding, and demonstrating significant performance improvements across various benchmarks.This paper presents an empirical study on enhancing Multimodal Large Language Models (MLLMs) with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. The authors investigate the embedding-based infusion of textual detection information, the impact of such infusion on MLLMs' original abilities, and the interchangeability of detection models. They conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO, revealing that their simple yet general approach not only refines MLLMs' performance in fine-grained visual tasks but also maintains their original strengths. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score. The authors release their codes to facilitate further exploration into the fine-grained multimodal capabilities of MLLMs. The paper contributes to the field by providing a thorough empirical investigation of integrating object detection and OCR models into MLLMs, developing a simple yet general approach for fine-grained image understanding, and demonstrating significant performance improvements across various benchmarks.

Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study

30 May 2024 | Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen