Understanding Focus Anywhere for Fine-grained Multi-page Document Understanding

The paper "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding" introduces Fox, an effective pipeline, hybrid data, and tuning strategy to enhance the fine-grained document understanding capabilities of Large Vision-Language Models (LVLMs). The authors address the limitations of existing LVLMs in handling detailed document tasks such as OCR, translation, and captioning, which require context from the entire page or multiple pages. Fox introduces a novel task that focuses LVLMs on specific regions of interest, redefining full-page OCR as foreground focus. It employs multiple vision vocabularies to extract hybrid visual knowledge for interleaved document pages and uses cross-vocabulary vision data to catalyze the full reaction of multiple visual vocabularies. Fox supports multi-page documents and various document formats, enabling it to focus on multiple cross-page regions in a single turn. The paper also builds a benchmark with 9 fine-grained sub-tasks to promote document analysis research. Experimental results demonstrate the superiority of Fox over other LVLMs in dense page OCR, region-level OCR/translation/summary, color-guided OCR, and multi-page OCR/VQA tasks.The paper "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding" introduces Fox, an effective pipeline, hybrid data, and tuning strategy to enhance the fine-grained document understanding capabilities of Large Vision-Language Models (LVLMs). The authors address the limitations of existing LVLMs in handling detailed document tasks such as OCR, translation, and captioning, which require context from the entire page or multiple pages. Fox introduces a novel task that focuses LVLMs on specific regions of interest, redefining full-page OCR as foreground focus. It employs multiple vision vocabularies to extract hybrid visual knowledge for interleaved document pages and uses cross-vocabulary vision data to catalyze the full reaction of multiple visual vocabularies. Fox supports multi-page documents and various document formats, enabling it to focus on multiple cross-page regions in a single turn. The paper also builds a benchmark with 9 fine-grained sub-tasks to promote document analysis research. Experimental results demonstrate the superiority of Fox over other LVLMs in dense page OCR, region-level OCR/translation/summary, color-guided OCR, and multi-page OCR/VQA tasks.

Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding

23 May 2024 | Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang