[slides and audio] Provably Robust DPO%3A Aligning Language Models with Noisy Feedback

This paper addresses the challenge of learning from noisy preference data in the context of aligning language models with human interests. It introduces a robust Direct Preference Optimization (rDPO) algorithm, which is designed to handle random preference flips. The authors propose a novel loss function that de-biases the effect of noise, making the policy trained by minimizing this loss robust to noisy preferences. The theoretical analysis shows that the sub-optimality gap of the rDPO policy compared to the optimal policy is of the order \(O(\frac{1}{1-2\epsilon} \sqrt{\frac{d}{n}})\), where \(\epsilon\) is the flip rate, \(d\) is the policy parameter dimension, and \(n\) is the dataset size. Empirical experiments on the IMDb sentiment generation and Anthropic's helpful-harmless dataset demonstrate that rDPO is more robust to noise in preference labels compared to vanilla DPO and other heuristics. The paper also discusses the generalizability of the approach to other preference optimization methods and preference models, such as SLiC, IPO, probit, and Plackett-Luce.This paper addresses the challenge of learning from noisy preference data in the context of aligning language models with human interests. It introduces a robust Direct Preference Optimization (rDPO) algorithm, which is designed to handle random preference flips. The authors propose a novel loss function that de-biases the effect of noise, making the policy trained by minimizing this loss robust to noisy preferences. The theoretical analysis shows that the sub-optimality gap of the rDPO policy compared to the optimal policy is of the order \(O(\frac{1}{1-2\epsilon} \sqrt{\frac{d}{n}})\), where \(\epsilon\) is the flip rate, \(d\) is the policy parameter dimension, and \(n\) is the dataset size. Empirical experiments on the IMDb sentiment generation and Anthropic's helpful-harmless dataset demonstrate that rDPO is more robust to noise in preference labels compared to vanilla DPO and other heuristics. The paper also discusses the generalizability of the approach to other preference optimization methods and preference models, such as SLiC, IPO, probit, and Plackett-Luce.

Provably Robust DPO: Aligning Language Models with Noisy Feedback

12 Apr 2024 | Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan