29 Feb 2024 | Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner
DocFinQA is a long-document financial reasoning dataset that extends the FinQA dataset by incorporating full SEC filings, significantly increasing the average context length from under 700 words to 123,000 words. The dataset includes 5,735 training, 780 development, and 922 test samples derived from 801 unique SEC filings. It provides a more realistic evaluation of financial reasoning capabilities by requiring models to process long documents and generate Python code to answer questions. The dataset is publicly available at github.com/anonymous.
The dataset includes multi-page documents rich with numeric data and tables, and each question is paired with a Python program that generates the answer. This allows for training and evaluating program synthesis models for use in financial workflows. The dataset was created by extending the FinQA dataset with full SEC filings, resulting in a significantly longer context in DocFinQA than FinQA.
The paper evaluates retrieval-based and long-context language models on DocFinQA. It finds that even state-of-the-art systems struggle with the dataset, highlighting the need for further study of financial domain nuances. The dataset is also used to evaluate retrieval-free approaches using long-context LLMs, showing that models like GPT-3.5 and Mistral perform well on long documents.
The paper also presents a case study on the longest documents in DocFinQA, finding that models particularly struggle on these documents. The dataset is shown to be an effective and difficult test for long-document QA, with room for significant improvement in this domain. The dataset is also used to evaluate the performance of various models, including Falcon, MPT, LLaMa 2, CodeLlama, Mistral, and GPT-3.5. The results show that larger models outperform smaller models, and models trained on code yield higher accuracy than non-code models.
The paper also discusses the limitations of the dataset, including the limited availability of complete SEC filings and the potential for false positive and false negative codes generated by the code. The dataset is also discussed in terms of its broader impact and ethical considerations, with the authors noting that it does not intend to criticize any one or more LLMs. The dataset is a resource to underscore the need for more long context oriented benchmarks both within and outside the financial domain.DocFinQA is a long-document financial reasoning dataset that extends the FinQA dataset by incorporating full SEC filings, significantly increasing the average context length from under 700 words to 123,000 words. The dataset includes 5,735 training, 780 development, and 922 test samples derived from 801 unique SEC filings. It provides a more realistic evaluation of financial reasoning capabilities by requiring models to process long documents and generate Python code to answer questions. The dataset is publicly available at github.com/anonymous.
The dataset includes multi-page documents rich with numeric data and tables, and each question is paired with a Python program that generates the answer. This allows for training and evaluating program synthesis models for use in financial workflows. The dataset was created by extending the FinQA dataset with full SEC filings, resulting in a significantly longer context in DocFinQA than FinQA.
The paper evaluates retrieval-based and long-context language models on DocFinQA. It finds that even state-of-the-art systems struggle with the dataset, highlighting the need for further study of financial domain nuances. The dataset is also used to evaluate retrieval-free approaches using long-context LLMs, showing that models like GPT-3.5 and Mistral perform well on long documents.
The paper also presents a case study on the longest documents in DocFinQA, finding that models particularly struggle on these documents. The dataset is shown to be an effective and difficult test for long-document QA, with room for significant improvement in this domain. The dataset is also used to evaluate the performance of various models, including Falcon, MPT, LLaMa 2, CodeLlama, Mistral, and GPT-3.5. The results show that larger models outperform smaller models, and models trained on code yield higher accuracy than non-code models.
The paper also discusses the limitations of the dataset, including the limited availability of complete SEC filings and the potential for false positive and false negative codes generated by the code. The dataset is also discussed in terms of its broader impact and ethical considerations, with the authors noting that it does not intend to criticize any one or more LLMs. The dataset is a resource to underscore the need for more long context oriented benchmarks both within and outside the financial domain.