Approaching Human-Level Forecasting with Language Models

Approaching Human-Level Forecasting with Language Models

28 Feb 2024 | Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt
This paper presents a language model (LM) system that achieves near-human performance in forecasting future events. The system combines retrieval and reasoning to generate forecasts, and is evaluated against human forecasters on a large dataset of questions from competitive forecasting platforms. The system's performance is measured using the Brier score, a standard metric in forecasting. The results show that the system approaches the performance of aggregated human forecasts, and in some cases surpasses them. The system is designed to automatically search for relevant information, generate forecasts, and aggregate predictions. The system is evaluated on a test set published after the knowledge cut-off of the LMs, ensuring no leakage from pre-training. The system uses a self-supervised fine-tuning method to improve the LM's ability to reason about forecasting tasks. The results show that the system outperforms the human crowd in certain settings, and that using LMs for forecasting could provide accurate predictions at scale and help inform institutional decision making. The paper also discusses the limitations of the system, including its reliance on pre-trained knowledge and the need for further research to improve its performance.This paper presents a language model (LM) system that achieves near-human performance in forecasting future events. The system combines retrieval and reasoning to generate forecasts, and is evaluated against human forecasters on a large dataset of questions from competitive forecasting platforms. The system's performance is measured using the Brier score, a standard metric in forecasting. The results show that the system approaches the performance of aggregated human forecasts, and in some cases surpasses them. The system is designed to automatically search for relevant information, generate forecasts, and aggregate predictions. The system is evaluated on a test set published after the knowledge cut-off of the LMs, ensuring no leakage from pre-training. The system uses a self-supervised fine-tuning method to improve the LM's ability to reason about forecasting tasks. The results show that the system outperforms the human crowd in certain settings, and that using LMs for forecasting could provide accurate predictions at scale and help inform institutional decision making. The paper also discusses the limitations of the system, including its reliance on pre-trained knowledge and the need for further research to improve its performance.
Reach us at info@study.space