6 Jun 2024 | Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim
This paper proposes SELFEE, a framework that improves the alignment of large language models (LLMs) using minimal human-labeled preference data. The key idea is to iteratively generate preference data and refine it through self-annotation to enhance LLM alignment. The framework leverages the human prior knowledge in small seed data and progressively improves the alignment of LLMs by generating responses and learning from self-annotated preference data. The proposed method derives preference labels from LLM logits to explicitly extract the model's inherent preference, which is more effective than previous methods relying on external reward models or implicit in-context learning. Additionally, a noise-aware preference learning algorithm is introduced to mitigate the risk of low-quality generated preference data. Experimental results show that SELFEE significantly improves LLM alignment, achieving a 16.4% increase in AlpacaEval2.0 win rate using only 3.3% of the ground-truth preference labels in Ultrafeedback data. The framework is also effective across various LLMs and improves overall LLM performance. The method is shown to be robust and generalizable, even without initial human preference data. The results demonstrate that SELFEE is highly competitive and practical for real-world applications. The framework is designed to be efficient and scalable, reducing the cost of preference data construction while maintaining high alignment performance. The method is evaluated on multiple benchmarks and outperforms existing preference labeling methods, including LLM-as-judge and PairRM. The results indicate that SELFEE is a promising approach for improving LLM alignment with minimal human-labeled preference data.This paper proposes SELFEE, a framework that improves the alignment of large language models (LLMs) using minimal human-labeled preference data. The key idea is to iteratively generate preference data and refine it through self-annotation to enhance LLM alignment. The framework leverages the human prior knowledge in small seed data and progressively improves the alignment of LLMs by generating responses and learning from self-annotated preference data. The proposed method derives preference labels from LLM logits to explicitly extract the model's inherent preference, which is more effective than previous methods relying on external reward models or implicit in-context learning. Additionally, a noise-aware preference learning algorithm is introduced to mitigate the risk of low-quality generated preference data. Experimental results show that SELFEE significantly improves LLM alignment, achieving a 16.4% increase in AlpacaEval2.0 win rate using only 3.3% of the ground-truth preference labels in Ultrafeedback data. The framework is also effective across various LLMs and improves overall LLM performance. The method is shown to be robust and generalizable, even without initial human preference data. The results demonstrate that SELFEE is highly competitive and practical for real-world applications. The framework is designed to be efficient and scalable, reducing the cost of preference data construction while maintaining high alignment performance. The method is evaluated on multiple benchmarks and outperforms existing preference labeling methods, including LLM-as-judge and PairRM. The results indicate that SELFEE is a promising approach for improving LLM alignment with minimal human-labeled preference data.