The Pushshift Reddit Dataset

The Pushshift Reddit Dataset

23 Jan 2020 | Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, Jeremy Blackburn
The Pushshift Reddit dataset is a comprehensive collection of Reddit data, including over 651 million submissions and 5.6 billion comments from 2005 to 2019. It is hosted on Pushshift.io, a platform that provides real-time data collection, analysis, and archiving of Reddit data. The dataset is accessible via monthly dumps and an API, allowing researchers to perform searches, aggregations, and exploratory analysis without needing to download large monthly dumps. Pushshift's API enables researchers to query the entire dataset efficiently, reducing the need for substantial storage capacity and making the data more accessible to a wider range of users. Additionally, Pushshift offers a Slackbot that allows researchers to interact with the data in real-time and visualize it for discussion. The dataset aligns with the FAIR principles, ensuring it is findable, accessible, interoperable, and reusable. The Pushshift Reddit dataset has been widely used in research across various fields, including online community governance, online extremism, online disinformation, web science, and health informatics. It has been cited in over 100 peer-reviewed publications, demonstrating its value as a critical resource for researchers. The dataset addresses challenges in data collection and access, particularly in the post-API era, by providing a reliable and accessible source of social media data.The Pushshift Reddit dataset is a comprehensive collection of Reddit data, including over 651 million submissions and 5.6 billion comments from 2005 to 2019. It is hosted on Pushshift.io, a platform that provides real-time data collection, analysis, and archiving of Reddit data. The dataset is accessible via monthly dumps and an API, allowing researchers to perform searches, aggregations, and exploratory analysis without needing to download large monthly dumps. Pushshift's API enables researchers to query the entire dataset efficiently, reducing the need for substantial storage capacity and making the data more accessible to a wider range of users. Additionally, Pushshift offers a Slackbot that allows researchers to interact with the data in real-time and visualize it for discussion. The dataset aligns with the FAIR principles, ensuring it is findable, accessible, interoperable, and reusable. The Pushshift Reddit dataset has been widely used in research across various fields, including online community governance, online extremism, online disinformation, web science, and health informatics. It has been cited in over 100 peer-reviewed publications, demonstrating its value as a critical resource for researchers. The dataset addresses challenges in data collection and access, particularly in the post-API era, by providing a reliable and accessible source of social media data.
Reach us at info@study.space
Understanding The Pushshift Reddit Dataset