[slides and audio] The Dark Side of Dataset Scaling%3A Evaluating Racial Classification in Multimodal Models

The paper "The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models" by Abeba Birhane, Sepehr Dehdashtian, Vinay Uday Prabhu, and Vishnu Boddehi examines the impact of dataset scaling on the performance and biases of 14 vision-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets. The study uses the Chicago Face Dataset (CFD) as a probe to measure racial and gender bias. Key findings include: 1. **Model Performance**: As the training data increased, the probability of pre-trained CLIP models misclassifying human images as non-human offensive classes (e.g., chimpanzee, gorilla, orangutan) decreased. However, the probability of misclassifying these images as human offensive classes (e.g., criminal) increased. 2. **Racial Bias**: For larger ViT-L models, the probability of predicting Black and Latino men as "criminal" increased significantly when the dataset was scaled from 400M to 2B samples. In contrast, smaller ViT-B models showed a decrease in this probability. 3. **Qualitative Analysis**: The study highlights the dehumanization and criminalization of Black bodies, aligning with existing literature on systemic racism and historical injustices. 4. **Recommendations**: The authors recommend avoiding ad-hoc decision-making in dataset curation, being cautious about physiognomic biases, ensuring thorough audits and evaluations, and advocating for open access to datasets and models to facilitate independent audits and regulatory development. 5. **Future Work**: The paper suggests extending the analysis to other CLIP models, exploring different prompt templates and class designs, and investigating biases in other face datasets. The study underscores the importance of rigorous dataset curation and model auditing to address and mitigate racial and gender biases in AI systems.The paper "The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models" by Abeba Birhane, Sepehr Dehdashtian, Vinay Uday Prabhu, and Vishnu Boddehi examines the impact of dataset scaling on the performance and biases of 14 vision-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets. The study uses the Chicago Face Dataset (CFD) as a probe to measure racial and gender bias. Key findings include: 1. **Model Performance**: As the training data increased, the probability of pre-trained CLIP models misclassifying human images as non-human offensive classes (e.g., chimpanzee, gorilla, orangutan) decreased. However, the probability of misclassifying these images as human offensive classes (e.g., criminal) increased. 2. **Racial Bias**: For larger ViT-L models, the probability of predicting Black and Latino men as "criminal" increased significantly when the dataset was scaled from 400M to 2B samples. In contrast, smaller ViT-B models showed a decrease in this probability. 3. **Qualitative Analysis**: The study highlights the dehumanization and criminalization of Black bodies, aligning with existing literature on systemic racism and historical injustices. 4. **Recommendations**: The authors recommend avoiding ad-hoc decision-making in dataset curation, being cautious about physiognomic biases, ensuring thorough audits and evaluations, and advocating for open access to datasets and models to facilitate independent audits and regulatory development. 5. **Future Work**: The paper suggests extending the analysis to other CLIP models, exploring different prompt templates and class designs, and investigating biases in other face datasets. The study underscores the importance of rigorous dataset curation and model auditing to address and mitigate racial and gender biases in AI systems.

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

June 3–6, 2024, Rio de Janeiro, Brazil | ABEBA BIRHANE, SEPEHR DEHDASHTIAN, VINAY UDAY PRABHU, VISHNU BODDETI