FNSPID: A Comprehensive Financial News Dataset in Time Series

FNSPID: A Comprehensive Financial News Dataset in Time Series

July 2017 | Zihan Dong, Xinyu Fan, Zhiyuan Peng
FNSPID is a large-scale financial dataset that combines 29.7 million stock prices and 15.7 million time-aligned financial news records for 4,775 S&P500 companies from 1999 to 2023. It provides a comprehensive resource for financial market analysis, integrating both quantitative and qualitative sentiment data. The dataset is designed to enhance market prediction accuracy and supports advanced machine learning techniques, including transformer-based models. FNSPID includes sentiment scores, which modestly improve performance on transformer-based models. The dataset is also reproducible and can be updated, making it a valuable tool for financial research. It offers unprecedented opportunities for predictive modeling and analysis in the financial sector. FNSPID is constructed by collecting numerical stock data and sentiment data from reputable sources, including NASDAQ, Bloomberg, Reuters, and others. The dataset is processed using various sentiment analysis methods, including LexRank, Luhn, Latent Semantic Analysis (LSA), and TextRank. Sentiment scores are quantified using ChatGPT, and data gaps are handled with an exponential decay method. The dataset is evaluated for its quantity and quality, showing that larger datasets and high-quality sentiment information significantly improve prediction accuracy. FNSPID is used for financial prediction, sentiment analysis, and risk management, and it is ethically constructed with a focus on privacy and data security. The dataset has limitations, including potential changes in website policies and the need for ongoing model validation. Future work includes expanding the dataset and exploring its applications in multi-modal models and financial generative AI. FNSPID is a significant advancement in financial forecasting, filling key gaps in existing resources and enabling more accurate financial analysis.FNSPID is a large-scale financial dataset that combines 29.7 million stock prices and 15.7 million time-aligned financial news records for 4,775 S&P500 companies from 1999 to 2023. It provides a comprehensive resource for financial market analysis, integrating both quantitative and qualitative sentiment data. The dataset is designed to enhance market prediction accuracy and supports advanced machine learning techniques, including transformer-based models. FNSPID includes sentiment scores, which modestly improve performance on transformer-based models. The dataset is also reproducible and can be updated, making it a valuable tool for financial research. It offers unprecedented opportunities for predictive modeling and analysis in the financial sector. FNSPID is constructed by collecting numerical stock data and sentiment data from reputable sources, including NASDAQ, Bloomberg, Reuters, and others. The dataset is processed using various sentiment analysis methods, including LexRank, Luhn, Latent Semantic Analysis (LSA), and TextRank. Sentiment scores are quantified using ChatGPT, and data gaps are handled with an exponential decay method. The dataset is evaluated for its quantity and quality, showing that larger datasets and high-quality sentiment information significantly improve prediction accuracy. FNSPID is used for financial prediction, sentiment analysis, and risk management, and it is ethically constructed with a focus on privacy and data security. The dataset has limitations, including potential changes in website policies and the need for ongoing model validation. Future work includes expanding the dataset and exploring its applications in multi-modal models and financial generative AI. FNSPID is a significant advancement in financial forecasting, filling key gaps in existing resources and enabling more accurate financial analysis.
Reach us at info@study.space