23 Jan 2020 | Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, Jeremy Blackburn
The Pushshift Reddit Dataset is a comprehensive collection of Reddit data, including submissions and comments from 2005 to 2019. The dataset is available in monthly dumps and through an API, making it accessible to researchers for various applications such as studying online community governance, extremism, disinformation, and health informatics. Pushshift, the platform that collects and provides this data, has a robust infrastructure for data collection, storage, and analysis, and it supports a wide range of research topics. The dataset has been used in over 100 peer-reviewed publications, highlighting its value and impact in the scientific community. The paper also discusses related work and the challenges in data collection and access, emphasizing the importance of datasets in advancing research in social media and related fields.The Pushshift Reddit Dataset is a comprehensive collection of Reddit data, including submissions and comments from 2005 to 2019. The dataset is available in monthly dumps and through an API, making it accessible to researchers for various applications such as studying online community governance, extremism, disinformation, and health informatics. Pushshift, the platform that collects and provides this data, has a robust infrastructure for data collection, storage, and analysis, and it supports a wide range of research topics. The dataset has been used in over 100 peer-reviewed publications, highlighting its value and impact in the scientific community. The paper also discusses related work and the challenges in data collection and access, emphasizing the importance of datasets in advancing research in social media and related fields.