Towards Scalable Automated Alignment of LLMs: A Survey

Towards Scalable Automated Alignment of LLMs: A Survey

17 Jul 2024 | Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu
This paper presents a survey on scalable automated alignment of large language models (LLMs). Alignment is critical for ensuring LLMs behave in line with human values. Traditional alignment methods relying on human annotations are increasingly inadequate due to the rapid development of LLMs, which now surpass human capabilities in many areas. This survey explores four major categories of automated alignment methods: inductive bias, behavior imitation, model feedback, and environment feedback. Each category is discussed in terms of its mechanisms, current status, and potential for future development. Inductive bias involves using assumptions and constraints to guide models toward desired behaviors without additional training signals. Behavior imitation aligns a target model by mimicking the behavior of a well-aligned model, such as using a well-aligned model to generate instruction-response pairs and then training the target model with imitation learning. Model feedback uses feedback from other models to guide the alignment of the target model. Environment feedback automatically obtains alignment signals from the environment, such as social interactions or public opinion. The survey also explores the underlying mechanisms that enable automated alignment and discusses the essential factors that make automated alignment feasible and effective. The paper concludes that automated alignment has the potential to address the core challenges posed by the rapid development of LLMs, where human annotation is either infeasible or extremely expensive. The most crucial part of automated alignment is finding a scalable alignment signal that can replace human manually-created preference signals and remain effective amid the rapid development of LLMs. The survey categorizes the rapidly developing automated alignment methods according to the mechanisms used to construct different alignment signals, summarizes the current developments in each direction, and discusses the developmental trajectory and potential future directions.This paper presents a survey on scalable automated alignment of large language models (LLMs). Alignment is critical for ensuring LLMs behave in line with human values. Traditional alignment methods relying on human annotations are increasingly inadequate due to the rapid development of LLMs, which now surpass human capabilities in many areas. This survey explores four major categories of automated alignment methods: inductive bias, behavior imitation, model feedback, and environment feedback. Each category is discussed in terms of its mechanisms, current status, and potential for future development. Inductive bias involves using assumptions and constraints to guide models toward desired behaviors without additional training signals. Behavior imitation aligns a target model by mimicking the behavior of a well-aligned model, such as using a well-aligned model to generate instruction-response pairs and then training the target model with imitation learning. Model feedback uses feedback from other models to guide the alignment of the target model. Environment feedback automatically obtains alignment signals from the environment, such as social interactions or public opinion. The survey also explores the underlying mechanisms that enable automated alignment and discusses the essential factors that make automated alignment feasible and effective. The paper concludes that automated alignment has the potential to address the core challenges posed by the rapid development of LLMs, where human annotation is either infeasible or extremely expensive. The most crucial part of automated alignment is finding a scalable alignment signal that can replace human manually-created preference signals and remain effective amid the rapid development of LLMs. The survey categorizes the rapidly developing automated alignment methods according to the mechanisms used to construct different alignment signals, summarizes the current developments in each direction, and discusses the developmental trajectory and potential future directions.
Reach us at info@study.space
[slides and audio] Towards Scalable Automated Alignment of LLMs%3A A Survey