TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

15 Mar 2024 | Yuliang Liu, IEEE Member, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai*, IEEE Senior Member
TextMonkey is a large multimodal model designed for text-centric tasks, enhancing cross-window connectivity through Shifted Window Attention with zero-initialization. It reduces token redundancy by filtering significant tokens and improves performance through text spotting and grounding. The model also incorporates positional information into responses, enhancing interpretability and enabling screenshot tasks via fine-tuning. Evaluation on 12 benchmarks shows significant improvements: 5.2% in scene text tasks, 6.9% in document-oriented tasks, and 2.8% in key information extraction. It outperforms previous models on OCRBench, achieving a score of 561. TextMonkey also excels in scene text spotting, with a 10.9% improvement. The model processes high-resolution images efficiently, maintains cross-window relationships, and reduces token length through a token resampler. It supports various text-related tasks and improves performance by leveraging positional cues. TextMonkey's methodology includes a Split Module for high-resolution image processing, Shifted Window Attention for cross-window connections, and a Token Resampler to compress redundant tokens. The model achieves strong performance across multiple benchmarks and demonstrates effectiveness in document analysis and understanding.TextMonkey is a large multimodal model designed for text-centric tasks, enhancing cross-window connectivity through Shifted Window Attention with zero-initialization. It reduces token redundancy by filtering significant tokens and improves performance through text spotting and grounding. The model also incorporates positional information into responses, enhancing interpretability and enabling screenshot tasks via fine-tuning. Evaluation on 12 benchmarks shows significant improvements: 5.2% in scene text tasks, 6.9% in document-oriented tasks, and 2.8% in key information extraction. It outperforms previous models on OCRBench, achieving a score of 561. TextMonkey also excels in scene text spotting, with a 10.9% improvement. The model processes high-resolution images efficiently, maintains cross-window relationships, and reduces token length through a token resampler. It supports various text-related tasks and improves performance by leveraging positional cues. TextMonkey's methodology includes a Split Module for high-resolution image processing, Shifted Window Attention for cross-window connections, and a Token Resampler to compress redundant tokens. The model achieves strong performance across multiple benchmarks and demonstrates effectiveness in document analysis and understanding.
Reach us at info@study.space
[slides and audio] TextMonkey%3A An OCR-Free Large Multimodal Model for Understanding Document