**TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models**
**Authors:** Ya-Qi Yu
**Abstract:**
Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks, but they are not well-suited for document-oriented tasks, which require fine-grained image perception and information compression. This paper presents TextHawk, a MLLM specifically designed for document-oriented tasks while preserving the general capabilities of MLLMs. TextHawk introduces four dedicated components to explore efficient fine-grained perception: a ReSampling and ReArrangement (ReSA) module to reduce redundancy and computational cost, Scalable Positional Embeddings (SPEs) to encode positions of local features, a Query Proposal Network (QPN) to dynamically initialize queries among sub-images, and a Multi-Level Cross-Attention (MLCA) mechanism to capture hierarchical structure and semantic relations. Additionally, a new instruction-tuning dataset, DocGemini, is created to enrich multimodal document data. Extensive experiments on general and document-oriented benchmarks demonstrate that TextHawk outperforms state-of-the-art methods, showcasing its effectiveness and superiority in fine-grained document perception and general vision-language abilities.
**Keywords:** Multimodal Large Language Models, Document Understanding, Visual Question Answering
**Introduction:**
Document-oriented MLLMs face unique challenges due to the high resolution and information density of document images. Previous works have attempted to address these challenges by increasing input resolution, using shape-adaptive cropping, and employing visual abstractors. However, there is still room for improvement in fine-grained visual perception and information compression. TextHawk aims to overcome these challenges by designing innovative components that enhance fine-grained visual perception and information compression.
**Method:**
TextHawk's architecture includes a frozen visual encoder, a resampler, and a large language model with a LoRA and a detection head. The resampler, ReSA, combines resampling and rearrangement to compress visual tokens. MLCA captures hierarchical structure and semantic relations, while SPEs extend positional embeddings to arbitrary shapes. QPN dynamically generates queries for sub-images, and a detection head improves coordinate representation efficiency.
**Experiments:**
TextHawk is evaluated on various benchmarks, including general and document-oriented tasks. Results show that TextHawk outperforms state-of-the-art methods, demonstrating its superior fine-grained document perception and general vision-language abilities. Ablation studies validate the effectiveness of each component.
**Conclusion:**
TextHawk is a novel MLLM designed to address the unique challenges of document-oriented tasks. It introduces innovative components that enhance fine-grained visual perception and information compression, achieving superior performance on both document-oriented and general benchmarks.**TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models**
**Authors:** Ya-Qi Yu
**Abstract:**
Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks, but they are not well-suited for document-oriented tasks, which require fine-grained image perception and information compression. This paper presents TextHawk, a MLLM specifically designed for document-oriented tasks while preserving the general capabilities of MLLMs. TextHawk introduces four dedicated components to explore efficient fine-grained perception: a ReSampling and ReArrangement (ReSA) module to reduce redundancy and computational cost, Scalable Positional Embeddings (SPEs) to encode positions of local features, a Query Proposal Network (QPN) to dynamically initialize queries among sub-images, and a Multi-Level Cross-Attention (MLCA) mechanism to capture hierarchical structure and semantic relations. Additionally, a new instruction-tuning dataset, DocGemini, is created to enrich multimodal document data. Extensive experiments on general and document-oriented benchmarks demonstrate that TextHawk outperforms state-of-the-art methods, showcasing its effectiveness and superiority in fine-grained document perception and general vision-language abilities.
**Keywords:** Multimodal Large Language Models, Document Understanding, Visual Question Answering
**Introduction:**
Document-oriented MLLMs face unique challenges due to the high resolution and information density of document images. Previous works have attempted to address these challenges by increasing input resolution, using shape-adaptive cropping, and employing visual abstractors. However, there is still room for improvement in fine-grained visual perception and information compression. TextHawk aims to overcome these challenges by designing innovative components that enhance fine-grained visual perception and information compression.
**Method:**
TextHawk's architecture includes a frozen visual encoder, a resampler, and a large language model with a LoRA and a detection head. The resampler, ReSA, combines resampling and rearrangement to compress visual tokens. MLCA captures hierarchical structure and semantic relations, while SPEs extend positional embeddings to arbitrary shapes. QPN dynamically generates queries for sub-images, and a detection head improves coordinate representation efficiency.
**Experiments:**
TextHawk is evaluated on various benchmarks, including general and document-oriented tasks. Results show that TextHawk outperforms state-of-the-art methods, demonstrating its superior fine-grained document perception and general vision-language abilities. Ablation studies validate the effectiveness of each component.
**Conclusion:**
TextHawk is a novel MLLM designed to address the unique challenges of document-oriented tasks. It introduces innovative components that enhance fine-grained visual perception and information compression, achieving superior performance on both document-oriented and general benchmarks.