Understanding Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

The paper introduces a method called Direct Large Model Alignment (DLMA) to align large language models (LLMs) with human expectations without relying on human-annotated preference data. The key contributions of the paper are: 1. **Evaluation of Response Preferences**: The authors propose a method to evaluate the preference of responses using the output probabilities of response pairs under contrastive prompt pairs. This method is more accurate than text-generation-based evaluation and is effective on LLaMA2-7B and LLaMA2-13B models. 2. **DLMA Method**: DLMA is a three-step process: - **Preference Data Generation**: Use contrastive prompts to generate response pairs. - **Rescore with Self-Rewarding**: Calculate a self-rewarding score by comparing the output probabilities of the response pairs under the contrastive prompts. - **Self-Rewarding DPO**: Use the self-rewarding score to align the LLMs through a revised direct preference optimization (DPO) algorithm. 3. **Experimental Results**: DLMA outperforms existing baselines, including RLHF, on datasets such as PKU-SafeRLHF, HH-Harmless, and HH+HeLpFu without requiring human-annotated preference data. The method also maintains text generation quality and shows better performance compared to models trained with human-annotated data. 4. **Limitations and Ethical Considerations**: The method is limited to models of a certain scale and does not evaluate preference data from other sources. Ethical considerations are discussed, emphasizing the goal of reducing harmful outputs and maintaining text quality. The paper provides a comprehensive evaluation of the proposed method, demonstrating its effectiveness and potential for aligning LLMs with human expectations in various scenarios.The paper introduces a method called Direct Large Model Alignment (DLMA) to align large language models (LLMs) with human expectations without relying on human-annotated preference data. The key contributions of the paper are: 1. **Evaluation of Response Preferences**: The authors propose a method to evaluate the preference of responses using the output probabilities of response pairs under contrastive prompt pairs. This method is more accurate than text-generation-based evaluation and is effective on LLaMA2-7B and LLaMA2-13B models. 2. **DLMA Method**: DLMA is a three-step process: - **Preference Data Generation**: Use contrastive prompts to generate response pairs. - **Rescore with Self-Rewarding**: Calculate a self-rewarding score by comparing the output probabilities of the response pairs under the contrastive prompts. - **Self-Rewarding DPO**: Use the self-rewarding score to align the LLMs through a revised direct preference optimization (DPO) algorithm. 3. **Experimental Results**: DLMA outperforms existing baselines, including RLHF, on datasets such as PKU-SafeRLHF, HH-Harmless, and HH+HeLpFu without requiring human-annotated preference data. The method also maintains text generation quality and shows better performance compared to models trained with human-annotated data. 4. **Limitations and Ethical Considerations**: The method is limited to models of a certain scale and does not evaluate preference data from other sources. Ethical considerations are discussed, emphasizing the goal of reducing harmful outputs and maintaining text quality. The paper provides a comprehensive evaluation of the proposed method, demonstrating its effectiveness and potential for aligning LLMs with human expectations in various scenarios.

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

15 Aug 2024 | Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen