mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

19 Mar 2024 | Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
The paper "mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding" addresses the critical role of structure information in understanding text-rich images, such as documents, tables, and charts. It proposes a Unified Structure Learning approach to enhance the performance of Multimodal Large Language Models (MLLMs) in visual document understanding. The approach includes structure-aware parsing tasks and multi-grained text localization tasks across five domains: document, webpage, table, chart, and natural image. To better encode structure information, the authors design the H-Reducer, a simple and effective vision-to-text module that maintains layout information and reduces visual feature length. They also construct a comprehensive training set, DocStruct4M, and a high-quality reasoning tuning dataset, DocReason25K, to support unified structure learning and trigger detailed explanation abilities. The proposed model, DocOwl 1.5, achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the performance of similar-sized MLLMs by more than 10 points in five out of ten benchmarks. The contributions of the work include the proposal of Unified Structure Learning, the design of the H-Reducer, the construction of comprehensive datasets, and the superior performance of DocOwl 1.5 in OCR-free visual document understanding tasks.The paper "mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding" addresses the critical role of structure information in understanding text-rich images, such as documents, tables, and charts. It proposes a Unified Structure Learning approach to enhance the performance of Multimodal Large Language Models (MLLMs) in visual document understanding. The approach includes structure-aware parsing tasks and multi-grained text localization tasks across five domains: document, webpage, table, chart, and natural image. To better encode structure information, the authors design the H-Reducer, a simple and effective vision-to-text module that maintains layout information and reduces visual feature length. They also construct a comprehensive training set, DocStruct4M, and a high-quality reasoning tuning dataset, DocReason25K, to support unified structure learning and trigger detailed explanation abilities. The proposed model, DocOwl 1.5, achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the performance of similar-sized MLLMs by more than 10 points in five out of ten benchmarks. The contributions of the work include the proposal of Unified Structure Learning, the design of the H-Reducer, the construction of comprehensive datasets, and the superior performance of DocOwl 1.5 in OCR-free visual document understanding tasks.
Reach us at info@study.space