3 Jul 2024 | Huayu Chen¹,², Guande He¹,², Lifan Yuan¹, Ganqu Cui¹, Hang Su¹,²,³, Jun Zhu¹,²,³*
This paper introduces two novel frameworks for aligning language models (LMs) with explicit rewards and preference data: InfoNCA and NCA. The proposed methods leverage Noise Contrastive Estimation (NCE) to directly extract LM policies from reward and preference data, addressing the limitations of existing alignment methods like Direct Preference Optimization (DPO). InfoNCA is shown to be a natural extension of DPO, subsuming it as a special case under pairwise preference settings. It is derived from Information Noise Contrastive Estimation (InfoNCE), a well-established contrastive learning method, and provides a multi-category classification objective for reward alignment. NCA, on the other hand, addresses the issue of decreasing response likelihood in InfoNCA by optimizing absolute likelihoods, leading to better performance in complex reasoning tasks like math and coding. The methods are evaluated on reward and preference datasets using Mistral-7B and 8×7B models, demonstrating that InfoNCA and NCA outperform preference-based baselines when reward data is available. The results show that NCA is more robust to hyperparameter changes and better at preventing the decrease in chosen likelihood, making it particularly effective for reasoning tasks. The paper also highlights the theoretical foundations of the proposed methods, showing that they align with established contrastive learning frameworks and provide strong guarantees for convergence to optimal LM policies.This paper introduces two novel frameworks for aligning language models (LMs) with explicit rewards and preference data: InfoNCA and NCA. The proposed methods leverage Noise Contrastive Estimation (NCE) to directly extract LM policies from reward and preference data, addressing the limitations of existing alignment methods like Direct Preference Optimization (DPO). InfoNCA is shown to be a natural extension of DPO, subsuming it as a special case under pairwise preference settings. It is derived from Information Noise Contrastive Estimation (InfoNCE), a well-established contrastive learning method, and provides a multi-category classification objective for reward alignment. NCA, on the other hand, addresses the issue of decreasing response likelihood in InfoNCA by optimizing absolute likelihoods, leading to better performance in complex reasoning tasks like math and coding. The methods are evaluated on reward and preference datasets using Mistral-7B and 8×7B models, demonstrating that InfoNCA and NCA outperform preference-based baselines when reward data is available. The results show that NCA is more robust to hyperparameter changes and better at preventing the decrease in chosen likelihood, making it particularly effective for reasoning tasks. The paper also highlights the theoretical foundations of the proposed methods, showing that they align with established contrastive learning frameworks and provide strong guarantees for convergence to optimal LM policies.