June 2024 | Sajjad Dadkhah, Xichen Zhang, Alexander Gerald Weismann, Amir Firouzi, and Ali A. Ghorbani
The TruthSeeker dataset is one of the largest ground-truth datasets for real/fake content detection on social media, containing over 180,000 labels from 2009 to 2022. It was created by crawling and crowd-sourcing data, with expert and crowd-sourced labeling using Amazon Mechanical Turk. The dataset includes binary and multiclass classifications, and was validated through multiple levels of verification to ensure accuracy. The dataset was used to train and test various machine learning and deep learning models, including BERT-based models, to detect fake content in tweets. Additionally, the dataset was analyzed using clustering algorithms to identify topics and relationships between tweets. The dataset also includes user scores such as bot score, credibility score, and influence score to better understand user behavior and the impact of their tweets. The results showed significant improvements in detecting fake content, especially for short-length texts. The TruthSeeker dataset is available for download and is a valuable resource for researchers in the field of fake news detection. The dataset was created by the Canadian Institute for Cybersecurity to help address the challenge of automatically detecting fake content on social media platforms.The TruthSeeker dataset is one of the largest ground-truth datasets for real/fake content detection on social media, containing over 180,000 labels from 2009 to 2022. It was created by crawling and crowd-sourcing data, with expert and crowd-sourced labeling using Amazon Mechanical Turk. The dataset includes binary and multiclass classifications, and was validated through multiple levels of verification to ensure accuracy. The dataset was used to train and test various machine learning and deep learning models, including BERT-based models, to detect fake content in tweets. Additionally, the dataset was analyzed using clustering algorithms to identify topics and relationships between tweets. The dataset also includes user scores such as bot score, credibility score, and influence score to better understand user behavior and the impact of their tweets. The results showed significant improvements in detecting fake content, especially for short-length texts. The TruthSeeker dataset is available for download and is a valuable resource for researchers in the field of fake news detection. The dataset was created by the Canadian Institute for Cybersecurity to help address the challenge of automatically detecting fake content on social media platforms.