Fox is a novel approach for achieving fine-grained multi-page document understanding using Large Vision-Language Models (LVLMs). The paper proposes a pipeline, hybrid data, and tuning strategy to enable LVLMs to focus on any region of single or multi-page documents. The key contributions include a novel task that enhances document understanding by focusing on document-level regions, a method to synthesize cross-vocabulary vision data to activate multiple visual vocabularies, and a solution that supports multi-column formats and multiple pages. The proposed Fox can efficiently tune to multi-page documents without modifying the weights of multiple vision vocabularies. The paper also introduces a benchmark with 9 fine-grained sub-tasks to promote document analysis. Experimental results show that Fox outperforms other LVLMs in multi-page document understanding tasks. The model can handle tasks such as region-level OCR, line-level OCR, color-guided OCR, paragraph summary, paragraph translation, and document layout. Fox can also support cross-page RoI understanding and provide OCR results for multiple pages in a single-turn conversation. The model is robust to different document formats and can handle complex multi-column layouts. The paper also discusses related works in visual document understanding, large language models, and large vision-language models. The proposed Fox is designed to be user-friendly and efficient, with a focus on achieving fine-grained document understanding. The model is tested on various datasets and shows strong performance in tasks such as dense text recognition, in-document figure captioning, and cross-page VQA. The results demonstrate that Fox can achieve high accuracy in these tasks, making it a promising approach for multi-page document understanding.Fox is a novel approach for achieving fine-grained multi-page document understanding using Large Vision-Language Models (LVLMs). The paper proposes a pipeline, hybrid data, and tuning strategy to enable LVLMs to focus on any region of single or multi-page documents. The key contributions include a novel task that enhances document understanding by focusing on document-level regions, a method to synthesize cross-vocabulary vision data to activate multiple visual vocabularies, and a solution that supports multi-column formats and multiple pages. The proposed Fox can efficiently tune to multi-page documents without modifying the weights of multiple vision vocabularies. The paper also introduces a benchmark with 9 fine-grained sub-tasks to promote document analysis. Experimental results show that Fox outperforms other LVLMs in multi-page document understanding tasks. The model can handle tasks such as region-level OCR, line-level OCR, color-guided OCR, paragraph summary, paragraph translation, and document layout. Fox can also support cross-page RoI understanding and provide OCR results for multiple pages in a single-turn conversation. The model is robust to different document formats and can handle complex multi-column layouts. The paper also discusses related works in visual document understanding, large language models, and large vision-language models. The proposed Fox is designed to be user-friendly and efficient, with a focus on achieving fine-grained document understanding. The model is tested on various datasets and shows strong performance in tasks such as dense text recognition, in-document figure captioning, and cross-page VQA. The results demonstrate that Fox can achieve high accuracy in these tasks, making it a promising approach for multi-page document understanding.