2024 | Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
WILDChat is a large-scale dataset of 1 million user-ChatGPT conversations, containing over 2.5 million interaction turns. It was collected by offering free access to ChatGPT in exchange for users' consent to anonymously collect their chat transcripts and request headers. The dataset includes demographic information such as state, country, and hashed IP addresses, along with request headers, enabling detailed analysis of user behavior across different regions and time periods. WILDChat is released under AI2 ImpACT Licenses and is available at https://wildchat.allen.ai.
The dataset provides a comprehensive, multi-turn, multilingual collection of user-chatbot interactions, offering a closer approximation to real-world conversations than existing datasets. It features a wide range of languages, diverse user prompts, and a rich variety of potentially toxic use cases. The dataset also includes a high level of toxicity, with over 10% of interactions being toxic, highlighting the need for further research into toxic chatbot interactions.
WILDChat was collected using two chatbot services powered by the GPT-3.5-Turbo and GPT-4 APIs. Data was collected from April 9, 2023, to May 1, 2024, and includes user consent, anonymized data, and demographic information. The dataset was processed to ensure privacy, with personally identifiable information (PII) removed and IP addresses hashed. The dataset is used for various research purposes, including instruction-tuning chatbots and analyzing user behavior.
The dataset was analyzed for toxicity, revealing that sexual content is the most prevalent type of toxicity. The dataset also includes a significant number of jailbreaking prompts, which are attempts by users to trick chatbots into generating restricted outputs. The dataset is also used for instruction-tuning, with results showing that fine-tuning a language model on the raw dataset results in a strong chatbot.
The dataset has several limitations, including potential demographic bias and toxicity selection bias. Despite these limitations, the dataset is valuable for research in conversational AI, user behavior analysis, and AI ethics. The dataset is released under AI2 ImpACT Licenses and is available for use in various research areas.WILDChat is a large-scale dataset of 1 million user-ChatGPT conversations, containing over 2.5 million interaction turns. It was collected by offering free access to ChatGPT in exchange for users' consent to anonymously collect their chat transcripts and request headers. The dataset includes demographic information such as state, country, and hashed IP addresses, along with request headers, enabling detailed analysis of user behavior across different regions and time periods. WILDChat is released under AI2 ImpACT Licenses and is available at https://wildchat.allen.ai.
The dataset provides a comprehensive, multi-turn, multilingual collection of user-chatbot interactions, offering a closer approximation to real-world conversations than existing datasets. It features a wide range of languages, diverse user prompts, and a rich variety of potentially toxic use cases. The dataset also includes a high level of toxicity, with over 10% of interactions being toxic, highlighting the need for further research into toxic chatbot interactions.
WILDChat was collected using two chatbot services powered by the GPT-3.5-Turbo and GPT-4 APIs. Data was collected from April 9, 2023, to May 1, 2024, and includes user consent, anonymized data, and demographic information. The dataset was processed to ensure privacy, with personally identifiable information (PII) removed and IP addresses hashed. The dataset is used for various research purposes, including instruction-tuning chatbots and analyzing user behavior.
The dataset was analyzed for toxicity, revealing that sexual content is the most prevalent type of toxicity. The dataset also includes a significant number of jailbreaking prompts, which are attempts by users to trick chatbots into generating restricted outputs. The dataset is also used for instruction-tuning, with results showing that fine-tuning a language model on the raw dataset results in a strong chatbot.
The dataset has several limitations, including potential demographic bias and toxicity selection bias. Despite these limitations, the dataset is valuable for research in conversational AI, user behavior analysis, and AI ethics. The dataset is released under AI2 ImpACT Licenses and is available for use in various research areas.