GRAM: Global Reasoning for Multi-Page VQA

GRAM: Global Reasoning for Multi-Page VQA

18 Mar 2024 | Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman
GRAM is a method that extends pretrained single-page models to the multi-page setting without requiring computationally-heavy pretraining. It leverages a single-page encoder for local page-level understanding and enhances it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce the model to utilize the newly introduced document tokens, a tailored bias adaptation method is proposed. For computational savings during decoding, an optional compression stage using the compression-transformer (C-Former) is introduced, reducing the encoded sequence length and allowing a tradeoff between quality and latency. Extensive experiments show that GRAM achieves state-of-the-art performance on multi-page DocVQA benchmarks, demonstrating the effectiveness of the approach. GRAM introduces document learnable tokens and bias adaptation to enable effective communication and collaboration between individual pages, supporting reasoning over multiple page documents. The C-Former module provides a trade-off between accuracy and compute, distilling information from multi-page sequences into more compact representations. GRAM achieves SOTA results on the MPDocVQA and DUDE datasets and provides extensive ablations for each component in the method. GRAM's architecture includes a bi-level global-local encoder, which allows information to flow between different pages. The encoder processes each page separately, along with learnable doc tokens, and facilitates communication between doc tokens across all pages. After M such blocks, the encoded features from all pages are fed into the decoder to produce the overall output. The global-local reasoning mechanism enables the model to understand the content of each page effectively and combine information across pages in the document. To address the issue of long sequences, GRAM segments the document into pages, its semantically logical parts. Interaction is restricted solely among doc learnable tokens across all pages, mitigating the computational burden of depending quadratically on the page count. An optional compression stage using C-Former is introduced to reduce the encoded sequence length, allowing a tradeoff between quality and latency. GRAM outperforms existing methods on the MPDocVQA and DUDE datasets, achieving state-of-the-art results. It demonstrates superior performance in handling multi-page documents, particularly in tasks requiring reasoning across multiple pages. The method is efficient and effective, providing a balance between accuracy and computational efficiency.GRAM is a method that extends pretrained single-page models to the multi-page setting without requiring computationally-heavy pretraining. It leverages a single-page encoder for local page-level understanding and enhances it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce the model to utilize the newly introduced document tokens, a tailored bias adaptation method is proposed. For computational savings during decoding, an optional compression stage using the compression-transformer (C-Former) is introduced, reducing the encoded sequence length and allowing a tradeoff between quality and latency. Extensive experiments show that GRAM achieves state-of-the-art performance on multi-page DocVQA benchmarks, demonstrating the effectiveness of the approach. GRAM introduces document learnable tokens and bias adaptation to enable effective communication and collaboration between individual pages, supporting reasoning over multiple page documents. The C-Former module provides a trade-off between accuracy and compute, distilling information from multi-page sequences into more compact representations. GRAM achieves SOTA results on the MPDocVQA and DUDE datasets and provides extensive ablations for each component in the method. GRAM's architecture includes a bi-level global-local encoder, which allows information to flow between different pages. The encoder processes each page separately, along with learnable doc tokens, and facilitates communication between doc tokens across all pages. After M such blocks, the encoded features from all pages are fed into the decoder to produce the overall output. The global-local reasoning mechanism enables the model to understand the content of each page effectively and combine information across pages in the document. To address the issue of long sequences, GRAM segments the document into pages, its semantically logical parts. Interaction is restricted solely among doc learnable tokens across all pages, mitigating the computational burden of depending quadratically on the page count. An optional compression stage using C-Former is introduced to reduce the encoded sequence length, allowing a tradeoff between quality and latency. GRAM outperforms existing methods on the MPDocVQA and DUDE datasets, achieving state-of-the-art results. It demonstrates superior performance in handling multi-page documents, particularly in tasks requiring reasoning across multiple pages. The method is efficient and effective, providing a balance between accuracy and computational efficiency.
Reach us at info@study.space
[slides and audio] GRAM%3A Global Reasoning for Multi-Page VQA