TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

15 Mar 2024 | Yuliang Liu, IEEE Member, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai*, IEEE Senior Member
TextMonkey is a large multimodal model (LMM) designed for text-centric tasks, particularly in document analysis and scene text understanding. The model introduces several enhancements to improve performance and interpretability: 1. **Shifted Window Attention with Zero Initialization**: This technique allows for cross-window connectivity at higher input resolutions, stabilizing early training and enhancing the model's ability to handle high-resolution images. 2. **Token Compression**: By using similarity to filter out redundant tokens, the model reduces token length while maintaining or improving performance. 3. **Text Grounding**: The model supports text grounding tasks, enhancing its interpretability and reliability. 4. **Finetuning for Screenshot Tasks**: TextMonkey can be fine-tuned to understand and respond to commands for screenshot clicking. Evaluation on 12 benchmarks shows significant improvements: - 5.2% in Scene Text-Centric tasks (e.g., STVQA, TextVQA, OCRVQA). - 6.9% in Document-Oriented tasks (e.g., DocVQA, InfoVQA, ChartVQA). - 2.8% in Key Information Extraction tasks (e.g., FUNSD, SPOIE, POIE). TextMonkey also sets a new standard on OCRBench, a comprehensive benchmark with 29 OCR-related assessments, achieving a score of 561, surpassing previous large multimodal models for document understanding. The code for TextMonkey is available at <https://github.com/Yuliang-Liu/Monkey>.TextMonkey is a large multimodal model (LMM) designed for text-centric tasks, particularly in document analysis and scene text understanding. The model introduces several enhancements to improve performance and interpretability: 1. **Shifted Window Attention with Zero Initialization**: This technique allows for cross-window connectivity at higher input resolutions, stabilizing early training and enhancing the model's ability to handle high-resolution images. 2. **Token Compression**: By using similarity to filter out redundant tokens, the model reduces token length while maintaining or improving performance. 3. **Text Grounding**: The model supports text grounding tasks, enhancing its interpretability and reliability. 4. **Finetuning for Screenshot Tasks**: TextMonkey can be fine-tuned to understand and respond to commands for screenshot clicking. Evaluation on 12 benchmarks shows significant improvements: - 5.2% in Scene Text-Centric tasks (e.g., STVQA, TextVQA, OCRVQA). - 6.9% in Document-Oriented tasks (e.g., DocVQA, InfoVQA, ChartVQA). - 2.8% in Key Information Extraction tasks (e.g., FUNSD, SPOIE, POIE). TextMonkey also sets a new standard on OCRBench, a comprehensive benchmark with 29 OCR-related assessments, achieving a score of 561, surpassing previous large multimodal models for document understanding. The code for TextMonkey is available at <https://github.com/Yuliang-Liu/Monkey>.
Reach us at info@study.space
[slides and audio] TextMonkey%3A An OCR-Free Large Multimodal Model for Understanding Document