FinTextQA is a new dataset for long-form financial question answering (LFQA), containing 1,262 high-quality, source-attributed question-answer pairs extracted from finance textbooks and government agency websites. The dataset includes six question types with an average text length of 19.7k words, curated through five rounds of human screening. It is the first dataset to integrate financial regulations and policies into LFQA, challenging models with more demanding content.
The dataset is designed to assess QA models on general finance and regulation or policy-related questions. It includes 1,022 pairs from finance textbooks (80.98% of the dataset) and 240 pairs from policies and regulations (19.02% of the dataset). The data is split into training, validation, and test sets in a 7:1:2 ratio for model fine-tuning and evaluation.
A Retrieval-Augmented Generation (RAG)-based LFQA system is developed, consisting of an embedder, retriever, reranker, and generator. The system is evaluated using human ranking, automatic metrics, and GPT-4 scoring. The results indicate that the most effective system configuration involves Ada2 as the embedder, Automated Merged Retrieval as the retriever, Bge-Reranker-Base as the reranker, and Baichuan2-7B as the generator. Models are less susceptible to noise after the length of contexts reaches a specific threshold.
The dataset is compared with existing LFQA datasets, showing that FinTextQA offers substantially longer questions and answers, as well as a wider scope. The dataset includes complex questions and longest answers alongside the widest scope compared to other finance QA datasets.
The RAG-based LFQA system is benchmarked with different configurations, showing that the best-performing system configurations incorporate GPT-3.5-turbo and Baichuan2-7B as generators. The results also show that the best system configuration in multi-document settings is the combination of AMR, Ada2, and Bge-Reranker-Base.
The study concludes that FinTextQA is a valuable resource for further research and evaluation of RAG modules and large language models. The dataset is comprehensive, covering complex financial question systems and including queries on financial regulations and policies. The study also introduces a robust evaluation system that leverages human ranking, automatic metrics, and GPT-4 scoring to assess various facets of model performance. The results suggest that the most effective combination of models and modules for finance-related LFQA tasks includes Ada2, AMR, Bge-Reranker-Base, and Baichuan2-7B.FinTextQA is a new dataset for long-form financial question answering (LFQA), containing 1,262 high-quality, source-attributed question-answer pairs extracted from finance textbooks and government agency websites. The dataset includes six question types with an average text length of 19.7k words, curated through five rounds of human screening. It is the first dataset to integrate financial regulations and policies into LFQA, challenging models with more demanding content.
The dataset is designed to assess QA models on general finance and regulation or policy-related questions. It includes 1,022 pairs from finance textbooks (80.98% of the dataset) and 240 pairs from policies and regulations (19.02% of the dataset). The data is split into training, validation, and test sets in a 7:1:2 ratio for model fine-tuning and evaluation.
A Retrieval-Augmented Generation (RAG)-based LFQA system is developed, consisting of an embedder, retriever, reranker, and generator. The system is evaluated using human ranking, automatic metrics, and GPT-4 scoring. The results indicate that the most effective system configuration involves Ada2 as the embedder, Automated Merged Retrieval as the retriever, Bge-Reranker-Base as the reranker, and Baichuan2-7B as the generator. Models are less susceptible to noise after the length of contexts reaches a specific threshold.
The dataset is compared with existing LFQA datasets, showing that FinTextQA offers substantially longer questions and answers, as well as a wider scope. The dataset includes complex questions and longest answers alongside the widest scope compared to other finance QA datasets.
The RAG-based LFQA system is benchmarked with different configurations, showing that the best-performing system configurations incorporate GPT-3.5-turbo and Baichuan2-7B as generators. The results also show that the best system configuration in multi-document settings is the combination of AMR, Ada2, and Bge-Reranker-Base.
The study concludes that FinTextQA is a valuable resource for further research and evaluation of RAG modules and large language models. The dataset is comprehensive, covering complex financial question systems and including queries on financial regulations and policies. The study also introduces a robust evaluation system that leverages human ranking, automatic metrics, and GPT-4 scoring to assess various facets of model performance. The results suggest that the most effective combination of models and modules for finance-related LFQA tasks includes Ada2, AMR, Bge-Reranker-Base, and Baichuan2-7B.