WILDCHAT: 1M ChatGPT Interaction Logs in the Wild

WILDCHAT: 1M ChatGPT Interaction Logs in the Wild

2 May 2024 | Wenting Zhao1*, Xiang Ren2,3, Jack Hessel2, Claire Cardie1, Yejin Choi2,4, Yuntian Deng2*
The paper introduces WILDCHAT, a comprehensive dataset of 1 million user-ChatGPT conversations, collected through a publicly accessible chatbot service. The dataset includes over 2.5 million interaction turns and demographic details such as state, country, and hashed IP addresses, providing a rich resource for studying real-world, multi-turn, and multilingual user-chatbot interactions. Key findings include: 1. **Dataset Statistics**: WILDCHAT features a wide range of languages, diverse user prompts, and a high level of toxicity (10.46% of user turns deemed toxic). It is the most diverse dataset in terms of user prompts and languages compared to other datasets like Alpaca, Open Assistant, Dolly, ShareGPT, and LMSYS-Chat-1M. 2. **Toxicity Analysis**: The dataset exhibits a high toxicity rate, with sexual content being the most prevalent type. The toxicity rate has shown a decline since June 2023, possibly due to an OpenAI model update. The dataset also includes examples of "jailbreaking" prompts, where users attempt to trick the chatbot into generating restricted outputs. 3. **Instruction Following**: WILDCHAT is used to fine-tune a Llama-2 7B model, resulting in a new model called WILDLLAMA. While WILDLLAMA outperforms other open-source models, it underperforms proprietary models like GPT-3.5 and GPT-4. The dataset's utility in fine-tuning is demonstrated, highlighting its potential for advancing conversational AI. 4. **Ethical Considerations**: The paper addresses ethical concerns, emphasizing the importance of protecting user privacy by removing personally identifiable information (PII) and using hashed IP addresses. Internal reviews ensure compliance with data protection laws and ethical standards. 5. **Conclusion**: WILDCHAT fills a gap in conversational AI research by providing a realistic dataset for studying user interactions and toxicity. It has potential applications in computational social science, user behavior analysis, and AI ethics.The paper introduces WILDCHAT, a comprehensive dataset of 1 million user-ChatGPT conversations, collected through a publicly accessible chatbot service. The dataset includes over 2.5 million interaction turns and demographic details such as state, country, and hashed IP addresses, providing a rich resource for studying real-world, multi-turn, and multilingual user-chatbot interactions. Key findings include: 1. **Dataset Statistics**: WILDCHAT features a wide range of languages, diverse user prompts, and a high level of toxicity (10.46% of user turns deemed toxic). It is the most diverse dataset in terms of user prompts and languages compared to other datasets like Alpaca, Open Assistant, Dolly, ShareGPT, and LMSYS-Chat-1M. 2. **Toxicity Analysis**: The dataset exhibits a high toxicity rate, with sexual content being the most prevalent type. The toxicity rate has shown a decline since June 2023, possibly due to an OpenAI model update. The dataset also includes examples of "jailbreaking" prompts, where users attempt to trick the chatbot into generating restricted outputs. 3. **Instruction Following**: WILDCHAT is used to fine-tune a Llama-2 7B model, resulting in a new model called WILDLLAMA. While WILDLLAMA outperforms other open-source models, it underperforms proprietary models like GPT-3.5 and GPT-4. The dataset's utility in fine-tuning is demonstrated, highlighting its potential for advancing conversational AI. 4. **Ethical Considerations**: The paper addresses ethical concerns, emphasizing the importance of protecting user privacy by removing personally identifiable information (PII) and using hashed IP addresses. Internal reviews ensure compliance with data protection laws and ethical standards. 5. **Conclusion**: WILDCHAT fills a gap in conversational AI research by providing a realistic dataset for studying user interactions and toxicity. It has potential applications in computational social science, user behavior analysis, and AI ethics.
Reach us at info@study.space