18 Mar 2024 | Tsachi Blau*, Sharon Fogel, Roi Ronen*, Alona Golts†, Roy Ganz*, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman†
The paper introduces GRAM (Global Reasoning for Multi-Page VQA), a method that extends pre-trained single-page document understanding models to handle multi-page documents without requiring additional computationally intensive pre-training. GRAM leverages a single-page encoder for local page-level understanding and enhances it with document-level designated layers and learnable tokens to facilitate information flow across pages for global reasoning. To enforce the use of document tokens, a tailored bias adaptation method is proposed. An optional compression stage using the C-Former reduces the encoded sequence length, balancing quality and latency during decoding. Extensive experiments demonstrate GRAM's state-of-the-art performance on multi-page DocVQA benchmarks, showcasing its effectiveness in handling multi-page documents while preserving single-page performance.The paper introduces GRAM (Global Reasoning for Multi-Page VQA), a method that extends pre-trained single-page document understanding models to handle multi-page documents without requiring additional computationally intensive pre-training. GRAM leverages a single-page encoder for local page-level understanding and enhances it with document-level designated layers and learnable tokens to facilitate information flow across pages for global reasoning. To enforce the use of document tokens, a tailored bias adaptation method is proposed. An optional compression stage using the C-Former reduces the encoded sequence length, balancing quality and latency during decoding. Extensive experiments demonstrate GRAM's state-of-the-art performance on multi-page DocVQA benchmarks, showcasing its effectiveness in handling multi-page documents while preserving single-page performance.