Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

15 Aug 2024 | Aiwei Liu, Haoping Bai, Zhiyuan Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen
This paper proposes a novel method called Direct Large Model Alignment (DLMA) to align large language models (LLMs) with human expectations without relying on human-annotated preference data. The method leverages contrastive prompts to generate preference data and employs a self-rewarding score to evaluate the quality of generated responses. The self-rewarding score is calculated by comparing the probabilities of responses under contrasting prompts, and it is then used in a modified direct preference optimization (DPO) algorithm to align the LLMs. The DLMA method consists of three main steps: (1) generating response pairs using contrastive prompts, (2) evaluating the generated response pairs using a self-rewarding score, and (3) applying the DPO algorithm with the self-rewarding score to align the LLM. The method is evaluated on three benchmark datasets: PKU-SafeRLHF, HH-Harmless, and HH-Helpful. The results show that DLMA outperforms existing baselines, including RLHF, in terms of win rates on these datasets. Additionally, the method demonstrates that the alignment process does not degrade the quality of generated text, as evidenced by similar perplexity scores between the DLMA model and baseline models. The DLMA method is effective in aligning LLMs with human expectations without requiring human-annotated preference data. It is particularly effective in scenarios where the generated responses are evaluated based on attributes such as harmlessness and helpfulness. The method's self-rewarding score is shown to be accurate in reflecting preference relationships, and it enables the LLM to learn and internalize the desired behavior through the DPO algorithm. The method is also shown to be effective across different LLMs, including Mistral-7B and Falcon-7B. The DLMA method is a promising approach for aligning LLMs with human expectations in a cost-effective and efficient manner.This paper proposes a novel method called Direct Large Model Alignment (DLMA) to align large language models (LLMs) with human expectations without relying on human-annotated preference data. The method leverages contrastive prompts to generate preference data and employs a self-rewarding score to evaluate the quality of generated responses. The self-rewarding score is calculated by comparing the probabilities of responses under contrasting prompts, and it is then used in a modified direct preference optimization (DPO) algorithm to align the LLMs. The DLMA method consists of three main steps: (1) generating response pairs using contrastive prompts, (2) evaluating the generated response pairs using a self-rewarding score, and (3) applying the DPO algorithm with the self-rewarding score to align the LLM. The method is evaluated on three benchmark datasets: PKU-SafeRLHF, HH-Harmless, and HH-Helpful. The results show that DLMA outperforms existing baselines, including RLHF, in terms of win rates on these datasets. Additionally, the method demonstrates that the alignment process does not degrade the quality of generated text, as evidenced by similar perplexity scores between the DLMA model and baseline models. The DLMA method is effective in aligning LLMs with human expectations without requiring human-annotated preference data. It is particularly effective in scenarios where the generated responses are evaluated based on attributes such as harmlessness and helpfulness. The method's self-rewarding score is shown to be accurate in reflecting preference relationships, and it enables the LLM to learn and internalize the desired behavior through the DPO algorithm. The method is also shown to be effective across different LLMs, including Mistral-7B and Falcon-7B. The DLMA method is a promising approach for aligning LLMs with human expectations in a cost-effective and efficient manner.
Reach us at info@study.space
[slides] Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation | StudySpace