[slides and audio] Consent in Crisis%3A The Rapid Decline of the AI Data Commons

The article "Consent in Crisis: The Rapid Decline of the AI Data Commons" presents a large-scale audit of consent protocols for web domains underlying AI training corpora. It finds that web domains have rapidly increased restrictions on AI data use, with over 5% of tokens in C4 and 28% of the most critical sources now fully restricted. Terms of Service restrictions have also increased, with 45% of C4 now restricted. These restrictions may bias AI systems by limiting diversity, freshness, and scaling laws. The audit reveals inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt, and that AI-specific clauses are proliferating to limit use. The study also finds that AI developers face varying levels of restriction, with OpenAI being the most restricted. The analysis shows that the most critical web domains are more likely to have restrictions, and that the composition of web data is shifting towards more monetized and multi-modal content. The study also highlights a mismatch between real-world AI usage and web data, with ChatGPT being used for creative tasks that are poorly represented in training data. The findings suggest that the open web is becoming increasingly restricted, which could impact AI development and academic research. The study calls for better protocols to express intentions and consent, and highlights the need for more effective communication of data use preferences. The research underscores the importance of data provenance, consent, and composition in AI systems.The article "Consent in Crisis: The Rapid Decline of the AI Data Commons" presents a large-scale audit of consent protocols for web domains underlying AI training corpora. It finds that web domains have rapidly increased restrictions on AI data use, with over 5% of tokens in C4 and 28% of the most critical sources now fully restricted. Terms of Service restrictions have also increased, with 45% of C4 now restricted. These restrictions may bias AI systems by limiting diversity, freshness, and scaling laws. The audit reveals inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt, and that AI-specific clauses are proliferating to limit use. The study also finds that AI developers face varying levels of restriction, with OpenAI being the most restricted. The analysis shows that the most critical web domains are more likely to have restrictions, and that the composition of web data is shifting towards more monetized and multi-modal content. The study also highlights a mismatch between real-world AI usage and web data, with ChatGPT being used for creative tasks that are poorly represented in training data. The findings suggest that the open web is becoming increasingly restricted, which could impact AI development and academic research. The study calls for better protocols to express intentions and consent, and highlights the need for more effective communication of data use preferences. The research underscores the importance of data provenance, consent, and composition in AI systems.