TextMonkey is a large multimodal model designed for text-centric tasks, enhancing cross-window connectivity through Shifted Window Attention with zero-initialization. It reduces token redundancy by filtering significant tokens and improves performance through text spotting and grounding. The model also incorporates positional information into responses, enhancing interpretability and enabling screenshot tasks via fine-tuning. Evaluation on 12 benchmarks shows significant improvements: 5.2% in scene text tasks, 6.9% in document-oriented tasks, and 2.8% in key information extraction. It outperforms previous models on OCRBench, achieving a score of 561. TextMonkey also excels in scene text spotting, with a 10.9% improvement. The model processes high-resolution images efficiently, maintains cross-window relationships, and reduces token length through a token resampler. It supports various text-related tasks and improves performance by leveraging positional cues. TextMonkey's methodology includes a Split Module for high-resolution image processing, Shifted Window Attention for cross-window connections, and a Token Resampler to compress redundant tokens. The model achieves strong performance across multiple benchmarks and demonstrates effectiveness in document analysis and understanding.TextMonkey is a large multimodal model designed for text-centric tasks, enhancing cross-window connectivity through Shifted Window Attention with zero-initialization. It reduces token redundancy by filtering significant tokens and improves performance through text spotting and grounding. The model also incorporates positional information into responses, enhancing interpretability and enabling screenshot tasks via fine-tuning. Evaluation on 12 benchmarks shows significant improvements: 5.2% in scene text tasks, 6.9% in document-oriented tasks, and 2.8% in key information extraction. It outperforms previous models on OCRBench, achieving a score of 561. TextMonkey also excels in scene text spotting, with a 10.9% improvement. The model processes high-resolution images efficiently, maintains cross-window relationships, and reduces token length through a token resampler. It supports various text-related tasks and improves performance by leveraging positional cues. TextMonkey's methodology includes a Split Module for high-resolution image processing, Shifted Window Attention for cross-window connections, and a Token Resampler to compress redundant tokens. The model achieves strong performance across multiple benchmarks and demonstrates effectiveness in document analysis and understanding.