Group Robust Preference Optimization in Reward-free RLHF

Group Robust Preference Optimization in Reward-free RLHF

May 31, 2024 | Shyam Sundhar Ramesh, Yifan Hu, Jason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic
This paper introduces Group Robust Preference Optimization (GRPO), a novel method for aligning large language models (LLMs) with diverse group preferences in reward-free reinforcement learning from human feedback (RLHF). Traditional RLHF approaches assume a single preference model, which may not be robust to the unique characteristics of different groups. GRPO addresses this by optimizing for the worst-case group performance, ensuring equitable alignment across all groups. The method builds upon reward-free direct preference optimization techniques and adaptively weights the importance of different groups, prioritizing those with worse cumulative loss. Theoretical analysis shows that GRPO is feasible and converges for the log-linear policy class. Empirical results demonstrate that GRPO significantly improves performance for the worst-performing groups, reduces loss imbalances, and increases probability accuracies compared to non-robust baselines. The paper also explores extensions of GRPO, including Group Robust Identity Preference Optimization (GR-IPO), and evaluates its effectiveness on synthetic and real-world datasets. Results show that GRPO outperforms existing methods in handling group disparities and achieving robust alignment. The approach is particularly effective in scenarios where group preferences differ significantly, ensuring that minority groups are not disadvantaged. The method is applicable to various settings, including alignment with diverse user preferences and task-specific domains. Overall, GRPO provides a more robust and equitable way to align LLMs with human preferences across different groups.This paper introduces Group Robust Preference Optimization (GRPO), a novel method for aligning large language models (LLMs) with diverse group preferences in reward-free reinforcement learning from human feedback (RLHF). Traditional RLHF approaches assume a single preference model, which may not be robust to the unique characteristics of different groups. GRPO addresses this by optimizing for the worst-case group performance, ensuring equitable alignment across all groups. The method builds upon reward-free direct preference optimization techniques and adaptively weights the importance of different groups, prioritizing those with worse cumulative loss. Theoretical analysis shows that GRPO is feasible and converges for the log-linear policy class. Empirical results demonstrate that GRPO significantly improves performance for the worst-performing groups, reduces loss imbalances, and increases probability accuracies compared to non-robust baselines. The paper also explores extensions of GRPO, including Group Robust Identity Preference Optimization (GR-IPO), and evaluates its effectiveness on synthetic and real-world datasets. Results show that GRPO outperforms existing methods in handling group disparities and achieving robust alignment. The approach is particularly effective in scenarios where group preferences differ significantly, ensuring that minority groups are not disadvantaged. The method is applicable to various settings, including alignment with diverse user preferences and task-specific domains. Overall, GRPO provides a more robust and equitable way to align LLMs with human preferences across different groups.
Reach us at info@study.space
[slides] Group Robust Preference Optimization in Reward-free RLHF | StudySpace