29 Feb 2024 | Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner
**DocFinQA: A Long-Context Financial Reasoning Dataset**
This paper introduces DocFinQA, a dataset for financial question answering that extends the existing FinQA dataset by incorporating full Securities and Exchange Commission (SEC) reports. The average context length in DocFinQA is significantly increased from under 700 words in FinQA to 123,000 words, making it a more realistic and challenging task for large language models (LLMs). The dataset includes 5,735 training, 780 development, and 922 test samples, derived from 801 unique SEC filings.
The authors conduct extensive experiments on retrieval-based QA pipelines and long-context LLMs, finding that even state-of-the-art systems struggle with the long-document context. They also perform a case study on the longest documents in DocFinQA, highlighting the difficulties models face in handling such extensive contexts.
The paper discusses the challenges of financial analysis for LLMs, emphasizing the need for realistic datasets and tasks. It evaluates retrieval-based and long-context LLM systems, using models like ColBERT, Sentence-BERT, and OpenAI’s Ada for retrieval tasks, and Falcon, MPT, LLaMa 2, CodeLlama, and Mistral for question answering tasks. The results show that retrieval-based models outperform retrieval-free approaches, and that fine-tuning and instruction tuning improve performance.
The paper concludes by highlighting the significance of DocFinQA in advancing research in quantitative financial question answering and the need for more long-context benchmarks. It also addresses limitations, such as the unvalidation of the training and development sets, and ethical considerations, emphasizing transparency and the lack of significant risks associated with the dataset.**DocFinQA: A Long-Context Financial Reasoning Dataset**
This paper introduces DocFinQA, a dataset for financial question answering that extends the existing FinQA dataset by incorporating full Securities and Exchange Commission (SEC) reports. The average context length in DocFinQA is significantly increased from under 700 words in FinQA to 123,000 words, making it a more realistic and challenging task for large language models (LLMs). The dataset includes 5,735 training, 780 development, and 922 test samples, derived from 801 unique SEC filings.
The authors conduct extensive experiments on retrieval-based QA pipelines and long-context LLMs, finding that even state-of-the-art systems struggle with the long-document context. They also perform a case study on the longest documents in DocFinQA, highlighting the difficulties models face in handling such extensive contexts.
The paper discusses the challenges of financial analysis for LLMs, emphasizing the need for realistic datasets and tasks. It evaluates retrieval-based and long-context LLM systems, using models like ColBERT, Sentence-BERT, and OpenAI’s Ada for retrieval tasks, and Falcon, MPT, LLaMa 2, CodeLlama, and Mistral for question answering tasks. The results show that retrieval-based models outperform retrieval-free approaches, and that fine-tuning and instruction tuning improve performance.
The paper concludes by highlighting the significance of DocFinQA in advancing research in quantitative financial question answering and the need for more long-context benchmarks. It also addresses limitations, such as the unvalidation of the training and development sets, and ethical considerations, emphasizing transparency and the lack of significant risks associated with the dataset.