FinBen: An Holistic Financial Benchmark for Large Language Models

FinBen: An Holistic Financial Benchmark for Large Language Models

19 Jun 2024 | Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyuan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, Jiajia Huang, Xiao-Yang Liu, Alejandro Lopez-Lira, Benyou Wang, Yanzhao Lai, Hao Wang, Min Peng, Sophia Ananiadou, Jimin Huang
**FinBen: An Holistic Financial Benchmark for Large Language Models** This paper introduces FinBen, a comprehensive open-source evaluation benchmark designed to assess the capabilities of large language models (LLMs) in the financial domain. FinBen includes 36 datasets covering 24 financial tasks across seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. The benchmark aims to address the lack of comprehensive evaluation benchmarks and the complexity of financial tasks, which have hindered the exploration of LLMs' potential in finance. Key innovations of FinBen include: 1. **Broader Task and Dataset Range**: FinBen covers a wider range of tasks and datasets compared to existing benchmarks, making it the most holistic evaluation benchmark for financial LLMs. 2. **First Evaluation of Stock Trading**: FinBen introduces the first evaluation of stock trading, a fundamental task in the financial sector. 3. **New Evaluation Strategies**: It includes agent-based evaluation and Retrieval-Augmented Generation (RAG) evaluation, providing a more dynamic and realistic assessment of LLMs. 4. **Novel Datasets**: FinBen proposes three new open-source datasets for text summarization, QA, and stock trading tasks. The evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and Gemini, reveals several key findings: 1. **Superior Capabilities with Limitations**: LLMs excel in IE and textual analysis but struggle with advanced reasoning and complex tasks like text generation and forecasting. 2. **Potential in Stock Trading**: SOTA LLMs show promise in stock trading but have significant room for improvement in reasoning and comprehensive forecasting. 3. **Closed-Source Superiority**: Closed-source commercial LLMs like GPT-4 and Gemini lead in performance within the financial domain. 4. **Open-Source Improvements and Limitations**: Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, demonstrating the benchmark's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community.**FinBen: An Holistic Financial Benchmark for Large Language Models** This paper introduces FinBen, a comprehensive open-source evaluation benchmark designed to assess the capabilities of large language models (LLMs) in the financial domain. FinBen includes 36 datasets covering 24 financial tasks across seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. The benchmark aims to address the lack of comprehensive evaluation benchmarks and the complexity of financial tasks, which have hindered the exploration of LLMs' potential in finance. Key innovations of FinBen include: 1. **Broader Task and Dataset Range**: FinBen covers a wider range of tasks and datasets compared to existing benchmarks, making it the most holistic evaluation benchmark for financial LLMs. 2. **First Evaluation of Stock Trading**: FinBen introduces the first evaluation of stock trading, a fundamental task in the financial sector. 3. **New Evaluation Strategies**: It includes agent-based evaluation and Retrieval-Augmented Generation (RAG) evaluation, providing a more dynamic and realistic assessment of LLMs. 4. **Novel Datasets**: FinBen proposes three new open-source datasets for text summarization, QA, and stock trading tasks. The evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and Gemini, reveals several key findings: 1. **Superior Capabilities with Limitations**: LLMs excel in IE and textual analysis but struggle with advanced reasoning and complex tasks like text generation and forecasting. 2. **Potential in Stock Trading**: SOTA LLMs show promise in stock trading but have significant room for improvement in reasoning and comprehensive forecasting. 3. **Closed-Source Superiority**: Closed-source commercial LLMs like GPT-4 and Gemini lead in performance within the financial domain. 4. **Open-Source Improvements and Limitations**: Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, demonstrating the benchmark's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community.
Reach us at info@study.space