January 17, 2024 | Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu
This paper presents ISARA, an algorithm for self-aligning large language models (LLMs) with limited samples. ISARA enables LLMs to self-align iteratively without human involvement, using retrieval-augmented in-context learning (ICL) to generate high-quality samples. The key idea is to first retrieve high-quality samples related to the target domain and use them as in-context learning examples to generate more samples. Then, the self-generated samples are used to fine-tune the LLM iteratively. ISARA does not rely on human-crafted instructions or labeled rewards, significantly reducing human involvement. It can self-improve the alignment continuously. The algorithm is tested on three benchmarks: safety, truthfulness, and instruction-following. Results show that ISARA performs well in alignment, domain adaptability, and scalability. ISARA is effective across various LLM sizes and can be applied to different domains without requiring redesigning principles or retraining reward models. The algorithm is also efficient in data scaling, achieving high scaling ratios. ISARA is able to balance utility and safety, producing informative content while minimizing harmful output. The results demonstrate that ISARA outperforms existing self-alignment methods in alignment performance, domain adaptability, and scalability.This paper presents ISARA, an algorithm for self-aligning large language models (LLMs) with limited samples. ISARA enables LLMs to self-align iteratively without human involvement, using retrieval-augmented in-context learning (ICL) to generate high-quality samples. The key idea is to first retrieve high-quality samples related to the target domain and use them as in-context learning examples to generate more samples. Then, the self-generated samples are used to fine-tune the LLM iteratively. ISARA does not rely on human-crafted instructions or labeled rewards, significantly reducing human involvement. It can self-improve the alignment continuously. The algorithm is tested on three benchmarks: safety, truthfulness, and instruction-following. Results show that ISARA performs well in alignment, domain adaptability, and scalability. ISARA is effective across various LLM sizes and can be applied to different domains without requiring redesigning principles or retraining reward models. The algorithm is also efficient in data scaling, achieving high scaling ratios. ISARA is able to balance utility and safety, producing informative content while minimizing harmful output. The results demonstrate that ISARA outperforms existing self-alignment methods in alignment performance, domain adaptability, and scalability.