PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

12 Jun 2024 | Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak
PAL: A Pluralistic Alignment Framework for Learning from Heterogeneous Preferences Large pre-trained foundation models are not directly deployable without alignment to human preferences. Alignment is typically done by collecting pairwise comparisons and learning a reward model or policy using the Bradley-Terry-Luce (BTL) model as a proxy for human preferences. However, this approach assumes a universal preference, which lacks flexibility in adapting to diverse opinions. PAL proposes a framework that models human preferences from the ground up, incorporating plurality. It uses the ideal point model to view alignment through preference comparisons, enabling the capture of diverse preferences while learning a common latent space. PAL uses the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with existing state-of-the-art models. Experiments show that PAL achieves competitive reward model accuracy on multiple datasets, including a new semisynthetic heterogeneous dataset. The framework is versatile and can be applied to various domains. PAL also highlights the shortcomings of current preference datasets created with rigid rubrics, which wash out heterogeneity, and calls for more nuanced data collection approaches. The framework is agnostic to modality and can adapt to heterogeneous preferences in synthetic, semi-synthetic, and real data. PAL's mixture modeling approach allows for personalized learning and generalization to unseen users. However, it may not generalize to users outside the convex hull of learned prototypes. The work contributes to building foundations for pluralistic alignment in ML/AI models and highlights the need for more nuanced data collection methods.PAL: A Pluralistic Alignment Framework for Learning from Heterogeneous Preferences Large pre-trained foundation models are not directly deployable without alignment to human preferences. Alignment is typically done by collecting pairwise comparisons and learning a reward model or policy using the Bradley-Terry-Luce (BTL) model as a proxy for human preferences. However, this approach assumes a universal preference, which lacks flexibility in adapting to diverse opinions. PAL proposes a framework that models human preferences from the ground up, incorporating plurality. It uses the ideal point model to view alignment through preference comparisons, enabling the capture of diverse preferences while learning a common latent space. PAL uses the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with existing state-of-the-art models. Experiments show that PAL achieves competitive reward model accuracy on multiple datasets, including a new semisynthetic heterogeneous dataset. The framework is versatile and can be applied to various domains. PAL also highlights the shortcomings of current preference datasets created with rigid rubrics, which wash out heterogeneity, and calls for more nuanced data collection approaches. The framework is agnostic to modality and can adapt to heterogeneous preferences in synthetic, semi-synthetic, and real data. PAL's mixture modeling approach allows for personalized learning and generalization to unseen users. However, it may not generalize to users outside the convex hull of learned prototypes. The work contributes to building foundations for pluralistic alignment in ML/AI models and highlights the need for more nuanced data collection methods.
Reach us at info@study.space