[slides and audio] FinBen%3A A Holistic Financial Benchmark for Large Language Models

FinBen is an open-source financial benchmark for large language models (LLMs), consisting of 36 datasets across 24 financial tasks, covering seven critical areas: information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making. It introduces several innovations, including a broader range of tasks, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluations, and three new open-source datasets for text summarization, question answering, and stock trading. The benchmark was used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their solutions outperformed GPT-4, demonstrating FinBen's potential to drive innovation in financial LLMs. The evaluation of 15 LLMs, including GPT-4, ChatGPT, and Gemini, revealed that while LLMs excel in information extraction and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in information extraction and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen provides a comprehensive assessment of LLM capabilities in financial contexts, highlighting their strengths and limitations. The benchmark includes diverse datasets from three sources: open-sourced datasets, existing evaluation benchmarks, and novel datasets. It covers tasks such as information extraction, textual analysis, question answering, text generation, forecasting, and risk management. The evaluation results show that closed-source commercial LLMs like GPT-4 perform well in financial tasks, while open-source models show improvements in some areas but face challenges in complex tasks. The benchmark also highlights the importance of financial decision-making, with GPT-4 achieving the highest Sharpe Ratio and minimal Maximum Drawdown, indicating superior investment performance. FinBen aims to continuously evolve, incorporating additional languages and expanding the range of financial tasks to enhance its applicability and impact. The authors acknowledge limitations, including dataset size, computational constraints, and the focus on American market data. They emphasize responsible usage and ethical guidelines to prevent potential misuse of the benchmark. The study underscores the need for further research to improve LLMs' capabilities in financial tasks, particularly in forecasting and decision-making.FinBen is an open-source financial benchmark for large language models (LLMs), consisting of 36 datasets across 24 financial tasks, covering seven critical areas: information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making. It introduces several innovations, including a broader range of tasks, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluations, and three new open-source datasets for text summarization, question answering, and stock trading. The benchmark was used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their solutions outperformed GPT-4, demonstrating FinBen's potential to drive innovation in financial LLMs. The evaluation of 15 LLMs, including GPT-4, ChatGPT, and Gemini, revealed that while LLMs excel in information extraction and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in information extraction and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen provides a comprehensive assessment of LLM capabilities in financial contexts, highlighting their strengths and limitations. The benchmark includes diverse datasets from three sources: open-sourced datasets, existing evaluation benchmarks, and novel datasets. It covers tasks such as information extraction, textual analysis, question answering, text generation, forecasting, and risk management. The evaluation results show that closed-source commercial LLMs like GPT-4 perform well in financial tasks, while open-source models show improvements in some areas but face challenges in complex tasks. The benchmark also highlights the importance of financial decision-making, with GPT-4 achieving the highest Sharpe Ratio and minimal Maximum Drawdown, indicating superior investment performance. FinBen aims to continuously evolve, incorporating additional languages and expanding the range of financial tasks to enhance its applicability and impact. The authors acknowledge limitations, including dataset size, computational constraints, and the focus on American market data. They emphasize responsible usage and ethical guidelines to prevent potential misuse of the benchmark. The study underscores the need for further research to improve LLMs' capabilities in financial tasks, particularly in forecasting and decision-making.

FinBen: An Holistic Financial Benchmark for Large Language Models