TextHawk is a Multimodal Large Language Model (MLLM) designed for document-oriented tasks while maintaining general capabilities. It introduces four key components: ReSampling and ReArrangement (ReSA) to reduce redundancy and compress visual tokens, Scalable Positional Embeddings (SPEs) to encode sub-image positions, a Query Proposal Network (QPN) to dynamically initialize queries, and a Multi-Level Cross-Attention (MLCA) mechanism to enhance fine-grained visual perception. TextHawk also uses a new instruction-tuning dataset enriched with Gemini Pro to improve performance. Extensive experiments on general and document-oriented benchmarks show that TextHawk outperforms state-of-the-art methods, demonstrating superior fine-grained document perception and general vision-language abilities. The model's architecture includes a frozen visual encoder, a resampler, and a large language model with LoRA and a detection head. The resampler compresses and rearranges visual information, while SPEs and QPN enhance positional encoding and query initialization. MLCA leverages hierarchical structure and semantic relations to improve visual perception. TextHawk achieves state-of-the-art results on both document and general benchmarks, showing its effectiveness in handling high-resolution document images and general vision-language tasks.TextHawk is a Multimodal Large Language Model (MLLM) designed for document-oriented tasks while maintaining general capabilities. It introduces four key components: ReSampling and ReArrangement (ReSA) to reduce redundancy and compress visual tokens, Scalable Positional Embeddings (SPEs) to encode sub-image positions, a Query Proposal Network (QPN) to dynamically initialize queries, and a Multi-Level Cross-Attention (MLCA) mechanism to enhance fine-grained visual perception. TextHawk also uses a new instruction-tuning dataset enriched with Gemini Pro to improve performance. Extensive experiments on general and document-oriented benchmarks show that TextHawk outperforms state-of-the-art methods, demonstrating superior fine-grained document perception and general vision-language abilities. The model's architecture includes a frozen visual encoder, a resampler, and a large language model with LoRA and a detection head. The resampler compresses and rearranges visual information, while SPEs and QPN enhance positional encoding and query initialization. MLCA leverages hierarchical structure and semantic relations to improve visual perception. TextHawk achieves state-of-the-art results on both document and general benchmarks, showing its effectiveness in handling high-resolution document images and general vision-language tasks.