**FinTextQA: A Dataset for Long-form Financial Question Answering**
This paper introduces *FinTextQA*, a novel dataset designed for long-form question answering (LFQA) in the finance domain. The dataset comprises 1,262 high-quality, source-attributed QA pairs extracted from finance textbooks and government agency websites, covering six question types with an average text length of 19.7k words. *FinTextQA* is the first LFQA dataset specifically tailored for finance, addressing the lack of scope diversity and question complexity in existing financial QA datasets.
The authors also develop a Retrieval-Augmented Generation (RAG)-based LFQA system, consisting of an embedder, retriever, reranker, and generator. They conduct a comprehensive evaluation using human ranking, automatic metrics, and GPT-4 scoring to benchmark the performance of different system configurations under noisy conditions. Key findings include:
1. **Model Performance**: Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy.
2. **Best System Configuration**: The most effective configuration involves Ada2 as the embedder, Automated Merged Retrieval as the retriever, Bge-Reranker-Base as the reranker, and Baichuan2-7B as the generator.
3. **Noise Resistance**: Models are less susceptible to noise when the length of contexts reaches a specific threshold.
The paper highlights the importance of integrating financial regulations and policies into LFQA tasks and provides a rich framework for building and assessing general finance LFQA systems. The experimental analysis underscores the need for enhancing current methodologies to improve both precision and explicability in financial question-answering systems.**FinTextQA: A Dataset for Long-form Financial Question Answering**
This paper introduces *FinTextQA*, a novel dataset designed for long-form question answering (LFQA) in the finance domain. The dataset comprises 1,262 high-quality, source-attributed QA pairs extracted from finance textbooks and government agency websites, covering six question types with an average text length of 19.7k words. *FinTextQA* is the first LFQA dataset specifically tailored for finance, addressing the lack of scope diversity and question complexity in existing financial QA datasets.
The authors also develop a Retrieval-Augmented Generation (RAG)-based LFQA system, consisting of an embedder, retriever, reranker, and generator. They conduct a comprehensive evaluation using human ranking, automatic metrics, and GPT-4 scoring to benchmark the performance of different system configurations under noisy conditions. Key findings include:
1. **Model Performance**: Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy.
2. **Best System Configuration**: The most effective configuration involves Ada2 as the embedder, Automated Merged Retrieval as the retriever, Bge-Reranker-Base as the reranker, and Baichuan2-7B as the generator.
3. **Noise Resistance**: Models are less susceptible to noise when the length of contexts reaches a specific threshold.
The paper highlights the importance of integrating financial regulations and policies into LFQA tasks and provides a rich framework for building and assessing general finance LFQA systems. The experimental analysis underscores the need for enhancing current methodologies to improve both precision and explicability in financial question-answering systems.