6 Jun 2024 | Michael J. Ryan, William Held, Diyi Yang
The paper explores the unintended impacts of Large Language Model (LLM) alignment on global representation, focusing on English dialects, multilingualism, and global opinions. Current alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), aim to align models with user preferences but may inadvertently create disparities. The study evaluates how alignment affects performance across three global representation axes: English dialects, multilingualism, and opinions from and about countries worldwide.
Results show that alignment improves multilingual performance in several languages but increases disparities between English dialects. For instance, alignment significantly increases the disparity between English dialects from about 1% before alignment to as high as 17.1% after alignment. Alignment also increases the similarity between model responses and opinions from the US relative to major nations from other regions, such as China, Jordan, and Nigeria. However, this bias does not propagate to the language model preference-tuned with this reward model.
The study also finds that alignment procedures can lead to biased representations of global opinions, with models showing a higher agreement with US opinions compared to other countries. The Starling 7B Reward Model, for example, rates 99.4% of all other countries more negatively than the USA. The paper discusses design decisions that led to these unintended impacts and recommends more equitable preference tuning. The authors make their code and data publicly available on GitHub.
The study highlights the importance of transparency in the alignment process and the need for more diverse and representative preference data. It also emphasizes the role of the training data in shaping the model's behavior, particularly in out-of-distribution settings. The findings suggest that while alignment can improve performance in certain languages, it may also exacerbate biases and disparities in global representation. The paper concludes with recommendations for more equitable and transparent alignment practices.The paper explores the unintended impacts of Large Language Model (LLM) alignment on global representation, focusing on English dialects, multilingualism, and global opinions. Current alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), aim to align models with user preferences but may inadvertently create disparities. The study evaluates how alignment affects performance across three global representation axes: English dialects, multilingualism, and opinions from and about countries worldwide.
Results show that alignment improves multilingual performance in several languages but increases disparities between English dialects. For instance, alignment significantly increases the disparity between English dialects from about 1% before alignment to as high as 17.1% after alignment. Alignment also increases the similarity between model responses and opinions from the US relative to major nations from other regions, such as China, Jordan, and Nigeria. However, this bias does not propagate to the language model preference-tuned with this reward model.
The study also finds that alignment procedures can lead to biased representations of global opinions, with models showing a higher agreement with US opinions compared to other countries. The Starling 7B Reward Model, for example, rates 99.4% of all other countries more negatively than the USA. The paper discusses design decisions that led to these unintended impacts and recommends more equitable preference tuning. The authors make their code and data publicly available on GitHub.
The study highlights the importance of transparency in the alignment process and the need for more diverse and representative preference data. It also emphasizes the role of the training data in shaping the model's behavior, particularly in out-of-distribution settings. The findings suggest that while alignment can improve performance in certain languages, it may also exacerbate biases and disparities in global representation. The paper concludes with recommendations for more equitable and transparent alignment practices.