Aligning Crowd Feedback via Distributional Preference Reward modelling

Aligning Crowd Feedback via Distributional Preference Reward modelling

30 May 2024 | Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu
This paper proposes a Distributional Preference Reward Model (DPRM) to align large language models (LLMs) with diverse human preferences. Traditional reward models rely on a small group of annotators, potentially leading to biased results. DPRM uses a categorical distribution to represent preferences and incorporates Bayesian updates to adapt to new or shifted preferences. It also employs optimal transport (OT) distance to align the model with the preference distribution. The model then fine-tunes the LLM using PPO to maximize expected rewards, generating responses that better align with population preferences. Experiments show that DPRM significantly improves alignment with population preferences, producing more accurate, unbiased, and contextually appropriate responses. The method addresses challenges such as limited representation and evolving preferences, offering a more robust approach to aligning LLMs with diverse human preferences.This paper proposes a Distributional Preference Reward Model (DPRM) to align large language models (LLMs) with diverse human preferences. Traditional reward models rely on a small group of annotators, potentially leading to biased results. DPRM uses a categorical distribution to represent preferences and incorporates Bayesian updates to adapt to new or shifted preferences. It also employs optimal transport (OT) distance to align the model with the preference distribution. The model then fine-tunes the LLM using PPO to maximize expected rewards, generating responses that better align with population preferences. Experiments show that DPRM significantly improves alignment with population preferences, producing more accurate, unbiased, and contextually appropriate responses. The method addresses challenges such as limited representation and evolving preferences, offering a more robust approach to aligning LLMs with diverse human preferences.
Reach us at info@study.space
Understanding Aligning Crowd Feedback via Distributional Preference Reward Modeling