This paper presents an empirical study on enhancing Multimodal Large Language Models (MLLMs) with state-of-the-art object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. The study investigates embedding-based infusion of textual detection information, the impact of such infusion on MLLMs' original abilities, and the interchangeability of detection models. The authors conduct systematic and extensive experiments with models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO, revealing that their simple yet general approach not only refines MLLMs' performance in fine-grained visual tasks but also maintains their original strengths. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score. The study also explores the replacement of detection models, showing that the enhanced MLLM remains functional and effective following the replacement of detection models, leaving great potential for future stronger detection models. The authors develop a simple yet general approach for fine-grained image understanding, which elevates LLaVA-1.5 with 12.5% and 11.5% overall improvements for 7B and 13B size respectively, across 10 comprehensive MLLM benchmarks. The code is publicly available to facilitate further research. The study highlights the value of detection model infusion and provides several practical insights, with the enhanced LLaVA-1.5 significantly outperforming its original models across all 10 benchmarks, making a notable improvement in multimodal understanding. The study also shows that the enhanced model effectively reduces hallucination and showcases enhanced efficacy in counting and localization tasks. For OCR tasks, it can also produce more accurate responses. The authors' contributions include a thorough empirical investigation of integrating object detection and OCR models into an MLLM, exploring the replacement of an open-set detector to enable question-driven detection, and developing a simple yet general approach for fine-grained image understanding. The study also discusses the limitations and future work, including the refinement of detection models and extended token sequences. The paper concludes that fine-tuning the original MLLM for an additional epoch, along with the simultaneous infusion of detection information, proves to be the most effective approach compared to the training-free strategy and retraining strategy. The study provides a series of progressive insights about the effective incorporation of vision detection models for MLLMs, and derives models that exhibit exceptional performance in various tasks such as counting, object localization, and text recognition.This paper presents an empirical study on enhancing Multimodal Large Language Models (MLLMs) with state-of-the-art object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. The study investigates embedding-based infusion of textual detection information, the impact of such infusion on MLLMs' original abilities, and the interchangeability of detection models. The authors conduct systematic and extensive experiments with models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO, revealing that their simple yet general approach not only refines MLLMs' performance in fine-grained visual tasks but also maintains their original strengths. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score. The study also explores the replacement of detection models, showing that the enhanced MLLM remains functional and effective following the replacement of detection models, leaving great potential for future stronger detection models. The authors develop a simple yet general approach for fine-grained image understanding, which elevates LLaVA-1.5 with 12.5% and 11.5% overall improvements for 7B and 13B size respectively, across 10 comprehensive MLLM benchmarks. The code is publicly available to facilitate further research. The study highlights the value of detection model infusion and provides several practical insights, with the enhanced LLaVA-1.5 significantly outperforming its original models across all 10 benchmarks, making a notable improvement in multimodal understanding. The study also shows that the enhanced model effectively reduces hallucination and showcases enhanced efficacy in counting and localization tasks. For OCR tasks, it can also produce more accurate responses. The authors' contributions include a thorough empirical investigation of integrating object detection and OCR models into an MLLM, exploring the replacement of an open-set detector to enable question-driven detection, and developing a simple yet general approach for fine-grained image understanding. The study also discusses the limitations and future work, including the refinement of detection models and extended token sequences. The paper concludes that fine-tuning the original MLLM for an additional epoch, along with the simultaneous infusion of detection information, proves to be the most effective approach compared to the training-free strategy and retraining strategy. The study provides a series of progressive insights about the effective incorporation of vision detection models for MLLMs, and derives models that exhibit exceptional performance in various tasks such as counting, object localization, and text recognition.