[slides and audio] Noise Contrastive Alignment of Language Models with Explicit Rewards

The paper introduces a general framework for aligning language models (LMs) using Noise Contrastive Estimation (NCE) to handle reward datasets explicitly annotated with scalar evaluations. The framework includes two parallel algorithms, NCA and InfoNCA, which enable the direct extraction of an LM policy from reward and preference data. InfoNCA subsumes Direct Preference Optimization (DPO) as a special case under pairwise preference settings, integrating and extending current alignment theories. The paper demonstrates that DPO and InfoNCA exhibit a decreasing likelihood trend due to their focus on adjusting relative likelihood across different responses, while NCA optimizes the absolute likelihood for each response, preventing the chosen likelihood from decreasing. Experiments on Mistral-7B and 8×7B models show that InfoNCA/NCA outperform various preference baselines when reward datasets are available and significantly outperform DPO in complex reasoning tasks like math and coding. The main contributions include bridging the theoretical gap between DPO and classic contrastive learning theories, showing the importance of suboptimal responses, and addressing the data likelihood decline issue in DPO.The paper introduces a general framework for aligning language models (LMs) using Noise Contrastive Estimation (NCE) to handle reward datasets explicitly annotated with scalar evaluations. The framework includes two parallel algorithms, NCA and InfoNCA, which enable the direct extraction of an LM policy from reward and preference data. InfoNCA subsumes Direct Preference Optimization (DPO) as a special case under pairwise preference settings, integrating and extending current alignment theories. The paper demonstrates that DPO and InfoNCA exhibit a decreasing likelihood trend due to their focus on adjusting relative likelihood across different responses, while NCA optimizes the absolute likelihood for each response, preventing the chosen likelihood from decreasing. Experiments on Mistral-7B and 8×7B models show that InfoNCA/NCA outperform various preference baselines when reward datasets are available and significantly outperform DPO in complex reasoning tasks like math and coding. The main contributions include bridging the theoretical gap between DPO and classic contrastive learning theories, showing the importance of suboptimal responses, and addressing the data likelihood decline issue in DPO.

Noise Contrastive Alignment of Language Models with Explicit Rewards

3 Jul 2024 | Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu