TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

2024 | Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng
TextHawk is a Multimodal Large Language Model (MLLM) designed for document-oriented tasks while maintaining general capabilities. It introduces four key components: ReSampling and ReArrangement (ReSA) to reduce redundancy and compress visual tokens, Scalable Positional Embeddings (SPEs) to encode sub-image positions, a Query Proposal Network (QPN) to dynamically initialize queries, and a Multi-Level Cross-Attention (MLCA) mechanism to enhance fine-grained visual perception. TextHawk also uses a new instruction-tuning dataset enriched with Gemini Pro to improve performance. Extensive experiments on general and document-oriented benchmarks show that TextHawk outperforms state-of-the-art methods, demonstrating superior fine-grained document perception and general vision-language abilities. The model's architecture includes a frozen visual encoder, a resampler, and a large language model with LoRA and a detection head. The resampler compresses and rearranges visual information, while SPEs and QPN enhance positional encoding and query initialization. MLCA leverages hierarchical structure and semantic relations to improve visual perception. TextHawk achieves state-of-the-art results on both document and general benchmarks, showing its effectiveness in handling high-resolution document images and general vision-language tasks.TextHawk is a Multimodal Large Language Model (MLLM) designed for document-oriented tasks while maintaining general capabilities. It introduces four key components: ReSampling and ReArrangement (ReSA) to reduce redundancy and compress visual tokens, Scalable Positional Embeddings (SPEs) to encode sub-image positions, a Query Proposal Network (QPN) to dynamically initialize queries, and a Multi-Level Cross-Attention (MLCA) mechanism to enhance fine-grained visual perception. TextHawk also uses a new instruction-tuning dataset enriched with Gemini Pro to improve performance. Extensive experiments on general and document-oriented benchmarks show that TextHawk outperforms state-of-the-art methods, demonstrating superior fine-grained document perception and general vision-language abilities. The model's architecture includes a frozen visual encoder, a resampler, and a large language model with LoRA and a detection head. The resampler compresses and rearranges visual information, while SPEs and QPN enhance positional encoding and query initialization. MLCA leverages hierarchical structure and semantic relations to improve visual perception. TextHawk achieves state-of-the-art results on both document and general benchmarks, showing its effectiveness in handling high-resolution document images and general vision-language tasks.
Reach us at info@study.space
Understanding TextHawk%3A Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models