TextMonkey is a large multimodal model (LMM) designed for text-centric tasks, particularly in document analysis and scene text understanding. The model introduces several enhancements to improve performance and interpretability:
1. **Shifted Window Attention with Zero Initialization**: This technique allows for cross-window connectivity at higher input resolutions, stabilizing early training and enhancing the model's ability to handle high-resolution images.
2. **Token Compression**: By using similarity to filter out redundant tokens, the model reduces token length while maintaining or improving performance.
3. **Text Grounding**: The model supports text grounding tasks, enhancing its interpretability and reliability.
4. **Finetuning for Screenshot Tasks**: TextMonkey can be fine-tuned to understand and respond to commands for screenshot clicking.
Evaluation on 12 benchmarks shows significant improvements:
- 5.2% in Scene Text-Centric tasks (e.g., STVQA, TextVQA, OCRVQA).
- 6.9% in Document-Oriented tasks (e.g., DocVQA, InfoVQA, ChartVQA).
- 2.8% in Key Information Extraction tasks (e.g., FUNSD, SPOIE, POIE).
TextMonkey also sets a new standard on OCRBench, a comprehensive benchmark with 29 OCR-related assessments, achieving a score of 561, surpassing previous large multimodal models for document understanding. The code for TextMonkey is available at <https://github.com/Yuliang-Liu/Monkey>.TextMonkey is a large multimodal model (LMM) designed for text-centric tasks, particularly in document analysis and scene text understanding. The model introduces several enhancements to improve performance and interpretability:
1. **Shifted Window Attention with Zero Initialization**: This technique allows for cross-window connectivity at higher input resolutions, stabilizing early training and enhancing the model's ability to handle high-resolution images.
2. **Token Compression**: By using similarity to filter out redundant tokens, the model reduces token length while maintaining or improving performance.
3. **Text Grounding**: The model supports text grounding tasks, enhancing its interpretability and reliability.
4. **Finetuning for Screenshot Tasks**: TextMonkey can be fine-tuned to understand and respond to commands for screenshot clicking.
Evaluation on 12 benchmarks shows significant improvements:
- 5.2% in Scene Text-Centric tasks (e.g., STVQA, TextVQA, OCRVQA).
- 6.9% in Document-Oriented tasks (e.g., DocVQA, InfoVQA, ChartVQA).
- 2.8% in Key Information Extraction tasks (e.g., FUNSD, SPOIE, POIE).
TextMonkey also sets a new standard on OCRBench, a comprehensive benchmark with 29 OCR-related assessments, achieving a score of 561, surpassing previous large multimodal models for document understanding. The code for TextMonkey is available at <https://github.com/Yuliang-Liu/Monkey>.