24 Jul 2024 | Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Mustafa Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Maticunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, and Sandy Pentland
The article "Consent in Crisis: The Rapid Decline of the AI Data Commons" presents a large-scale audit of consent protocols for web domains underlying AI training corpora. It finds that web domains have rapidly increased restrictions on AI data use, with over 5% of tokens in C4 and 28% of the most critical sources now fully restricted. Terms of Service restrictions have also increased, with 45% of C4 now restricted. These restrictions may bias AI systems by limiting diversity, freshness, and scaling laws. The audit reveals inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt, and that AI-specific clauses are proliferating to limit use. The study also finds that AI developers face varying levels of restriction, with OpenAI being the most restricted. The analysis shows that the most critical web domains are more likely to have restrictions, and that the composition of web data is shifting towards more monetized and multi-modal content. The study also highlights a mismatch between real-world AI usage and web data, with ChatGPT being used for creative tasks that are poorly represented in training data. The findings suggest that the open web is becoming increasingly restricted, which could impact AI development and academic research. The study calls for better protocols to express intentions and consent, and highlights the need for more effective communication of data use preferences. The research underscores the importance of data provenance, consent, and composition in AI systems.The article "Consent in Crisis: The Rapid Decline of the AI Data Commons" presents a large-scale audit of consent protocols for web domains underlying AI training corpora. It finds that web domains have rapidly increased restrictions on AI data use, with over 5% of tokens in C4 and 28% of the most critical sources now fully restricted. Terms of Service restrictions have also increased, with 45% of C4 now restricted. These restrictions may bias AI systems by limiting diversity, freshness, and scaling laws. The audit reveals inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt, and that AI-specific clauses are proliferating to limit use. The study also finds that AI developers face varying levels of restriction, with OpenAI being the most restricted. The analysis shows that the most critical web domains are more likely to have restrictions, and that the composition of web data is shifting towards more monetized and multi-modal content. The study also highlights a mismatch between real-world AI usage and web data, with ChatGPT being used for creative tasks that are poorly represented in training data. The findings suggest that the open web is becoming increasingly restricted, which could impact AI development and academic research. The study calls for better protocols to express intentions and consent, and highlights the need for more effective communication of data use preferences. The research underscores the importance of data provenance, consent, and composition in AI systems.