6 Jun 2024 | Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim
The paper "Aligning Large Language Models with Self-generated Preference Data" by Dongyoung Kim, Kimin Lee, Jinwoo Shin, and Jaehyung Kim addresses the challenge of aligning large language models (LLMs) with human preferences, which is crucial for achieving state-of-the-art performance. However, constructing a large human-annotated preference dataset is costly. To tackle this issue, the authors propose a new framework called Self-generated Preference data (SELFEE), which uses minimal human-annotated preference data to improve LLM alignment.
SELFEE leverages human prior knowledge within a small seed dataset and progressively enhances LLM alignment through iterative generation and learning from self-annotated preference data. The key contributions of SELFEE include:
1. Deriving preference labels from LLM logits to explicitly extract the model's inherent preference.
2. Introducing a confidence-based refinement of preference labels to reduce noise in preference learning.
3. Using a linearly extrapolated prediction between the current and reference models to approximate the predictions of a more strongly aligned model, enhancing noise identification.
The authors demonstrate the effectiveness of SELFEE by achieving superior alignment performance on the AlpacaEval 2.0 benchmark with only 3.3% of ground-truth preference labels in the Ultrafeedback dataset compared to using the entire dataset or state-of-the-art baselines. They also show that SELFEE can improve alignment across various LLMs and even without initial human preference data.
The paper includes a detailed experimental setup, evaluation results, and ablation studies to validate the effectiveness of SELFEE. The results consistently demonstrate that SELFEE outperforms other preference judgment methods and baseline approaches, both in terms of win rates and length-controlled win rates. Additionally, SELFEE is shown to be effective across different models and can improve overall LLM capabilities.The paper "Aligning Large Language Models with Self-generated Preference Data" by Dongyoung Kim, Kimin Lee, Jinwoo Shin, and Jaehyung Kim addresses the challenge of aligning large language models (LLMs) with human preferences, which is crucial for achieving state-of-the-art performance. However, constructing a large human-annotated preference dataset is costly. To tackle this issue, the authors propose a new framework called Self-generated Preference data (SELFEE), which uses minimal human-annotated preference data to improve LLM alignment.
SELFEE leverages human prior knowledge within a small seed dataset and progressively enhances LLM alignment through iterative generation and learning from self-annotated preference data. The key contributions of SELFEE include:
1. Deriving preference labels from LLM logits to explicitly extract the model's inherent preference.
2. Introducing a confidence-based refinement of preference labels to reduce noise in preference learning.
3. Using a linearly extrapolated prediction between the current and reference models to approximate the predictions of a more strongly aligned model, enhancing noise identification.
The authors demonstrate the effectiveness of SELFEE by achieving superior alignment performance on the AlpacaEval 2.0 benchmark with only 3.3% of ground-truth preference labels in the Ultrafeedback dataset compared to using the entire dataset or state-of-the-art baselines. They also show that SELFEE can improve alignment across various LLMs and even without initial human preference data.
The paper includes a detailed experimental setup, evaluation results, and ablation studies to validate the effectiveness of SELFEE. The results consistently demonstrate that SELFEE outperforms other preference judgment methods and baseline approaches, both in terms of win rates and length-controlled win rates. Additionally, SELFEE is shown to be effective across different models and can improve overall LLM capabilities.