[slides and audio] Aligning Crowd Feedback via Distributional Preference Reward Modeling

The paper introduces the Distributional Preference Reward Model (DPRM) to align Large Language Models (LLMs) with diverse human preferences. Traditional reward modeling relies heavily on human annotations, which can lead to skewed models that do not represent the broader population's expectations. DPRM addresses this issue by characterizing multiple preferences using a categorical distribution and incorporating a Bayesian updater to handle shifted or new preferences. The model uses an optimal transport (OT) distance to calibrate the reward, ensuring more accurate alignment with the population's preference distribution. The expected reward is then used to fine-tune the LLM policy, generating responses favored by the population. Experiments show that DPRM significantly enhances the alignment of LLMs with population preferences, resulting in more accurate, unbiased, and contextually appropriate responses. The paper also includes theoretical analysis and empirical results to validate the effectiveness of DPRM.The paper introduces the Distributional Preference Reward Model (DPRM) to align Large Language Models (LLMs) with diverse human preferences. Traditional reward modeling relies heavily on human annotations, which can lead to skewed models that do not represent the broader population's expectations. DPRM addresses this issue by characterizing multiple preferences using a categorical distribution and incorporating a Bayesian updater to handle shifted or new preferences. The model uses an optimal transport (OT) distance to calibrate the reward, ensuring more accurate alignment with the population's preference distribution. The expected reward is then used to fine-tune the LLM policy, generating responses favored by the population. Experiments show that DPRM significantly enhances the alignment of LLMs with population preferences, resulting in more accurate, unbiased, and contextually appropriate responses. The paper also includes theoretical analysis and empirical results to validate the effectiveness of DPRM.

Aligning Crowd Feedback via Distributional Preference Reward modelling

30 May 2024 | Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu