24 Jul 2024 | Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marmu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Mustafa Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Siloo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shipple, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, and Sandy Pentland
The paper "Consent in Crisis: The Rapid Decline of the AI Data Commons" by Shayne Longpre et al. examines the rapid decline in consent for web data used in AI training. The authors conduct a large-scale, longitudinal audit of 14,000 web domains, focusing on three major open-source AI training corpora: C4, RefinedWeb, and Dolma. They find that there has been a significant increase in data restrictions from web sources, with over 5% of tokens in C4 and 28% of critical sources being fully restricted within a year (2023-2024). The restrictions are more pronounced for OpenAI's crawlers, and there are inconsistencies between robots.txt and Terms of Service, indicating ineffective protocols. The study also reveals that the head distribution of web domains (news, encyclopedias, and social media) has higher rates of user content, multi-modal content, and monetization, while the long tail consists of organization websites, blogs, and e-commerce sites. Additionally, the real-world usage of conversational AI does not align with the content in training datasets, particularly in terms of creative compositions and sexual role-play. The authors conclude that the increasing restrictions will skew data representativity, freshness, and scaling laws, affecting both commercial and non-commercial AI development, as well as academic research. They advocate for better protocols to express intentions and consent, emphasizing the need for more effective communication and control over data usage.The paper "Consent in Crisis: The Rapid Decline of the AI Data Commons" by Shayne Longpre et al. examines the rapid decline in consent for web data used in AI training. The authors conduct a large-scale, longitudinal audit of 14,000 web domains, focusing on three major open-source AI training corpora: C4, RefinedWeb, and Dolma. They find that there has been a significant increase in data restrictions from web sources, with over 5% of tokens in C4 and 28% of critical sources being fully restricted within a year (2023-2024). The restrictions are more pronounced for OpenAI's crawlers, and there are inconsistencies between robots.txt and Terms of Service, indicating ineffective protocols. The study also reveals that the head distribution of web domains (news, encyclopedias, and social media) has higher rates of user content, multi-modal content, and monetization, while the long tail consists of organization websites, blogs, and e-commerce sites. Additionally, the real-world usage of conversational AI does not align with the content in training datasets, particularly in terms of creative compositions and sexual role-play. The authors conclude that the increasing restrictions will skew data representativity, freshness, and scaling laws, affecting both commercial and non-commercial AI development, as well as academic research. They advocate for better protocols to express intentions and consent, emphasizing the need for more effective communication and control over data usage.