2024 | Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, Alexander D'Amour
This paper investigates the correspondence between decontextualized "trick tests" and Realistic Use and Tangible Effects (RUTEd) evaluations of gender-occupation bias in language models. The authors compare three decontextualized evaluations adapted from current literature with three analogous RUTEd evaluations applied to long-form content generation tasks. They conduct these evaluations for seven instruction-tuned LLMs, including the Llama-2 and Flan-PaLM models. The RUTEd evaluations involve repeated trials of three text generation tasks: children's bedtime stories, user personas, and English language learning exercises. The results show no significant correlation between trick tests and RUTEd evaluations. Specifically, selecting the least biased model based on decontextualized results coincides with selecting the model with the best performance on RUTEd evaluations only as often as random chance. The authors conclude that evaluations not based on realistic use are likely insufficient to mitigate and assess bias and real-world harms. They advocate for more sociotechnical evaluations of AI, tailored to specific contexts of use, to better understand and address potential biases in LLMs.This paper investigates the correspondence between decontextualized "trick tests" and Realistic Use and Tangible Effects (RUTEd) evaluations of gender-occupation bias in language models. The authors compare three decontextualized evaluations adapted from current literature with three analogous RUTEd evaluations applied to long-form content generation tasks. They conduct these evaluations for seven instruction-tuned LLMs, including the Llama-2 and Flan-PaLM models. The RUTEd evaluations involve repeated trials of three text generation tasks: children's bedtime stories, user personas, and English language learning exercises. The results show no significant correlation between trick tests and RUTEd evaluations. Specifically, selecting the least biased model based on decontextualized results coincides with selecting the model with the best performance on RUTEd evaluations only as often as random chance. The authors conclude that evaluations not based on realistic use are likely insufficient to mitigate and assess bias and real-world harms. They advocate for more sociotechnical evaluations of AI, tailored to specific contexts of use, to better understand and address potential biases in LLMs.