InverseRLignment: Large Language Model Alignment from Demonstrations through Inverse Reinforcement Learning

InverseRLignment: Large Language Model Alignment from Demonstrations through Inverse Reinforcement Learning

2024 | Hao Sun, Mihaela van der Schaar
The paper introduces a novel approach called Alignment from Demonstrations (AFD) to align Large Language Models (LLMs) using high-quality demonstration data, addressing the limitations of existing methods that rely on preference datasets. AFD overcomes challenges such as noisy labels, high annotation costs, privacy concerns, and the need for inductive biases in reward modeling. The approach is formalized within a sequential decision-making framework, highlighting the missing reward signals in LLM alignment. The paper proposes divergence minimization objectives for AFD, both analytically and empirically, and introduces an efficient algorithm for extrapolating over a tailored reward model. Experiments on the Harmless and Helpful tasks from the Anthropic HH-RLHF dataset demonstrate the effectiveness of AFD, showing superior alignment performance compared to preference-based methods. The work establishes AFD as a viable and efficient alternative to Reinforcement Learning from Human Feedback (RLHF), enhancing the safety and reliability of LLMs.The paper introduces a novel approach called Alignment from Demonstrations (AFD) to align Large Language Models (LLMs) using high-quality demonstration data, addressing the limitations of existing methods that rely on preference datasets. AFD overcomes challenges such as noisy labels, high annotation costs, privacy concerns, and the need for inductive biases in reward modeling. The approach is formalized within a sequential decision-making framework, highlighting the missing reward signals in LLM alignment. The paper proposes divergence minimization objectives for AFD, both analytically and empirically, and introduces an efficient algorithm for extrapolating over a tailored reward model. Experiments on the Harmless and Helpful tasks from the Anthropic HH-RLHF dataset demonstrate the effectiveness of AFD, showing superior alignment performance compared to preference-based methods. The work establishes AFD as a viable and efficient alternative to Reinforcement Learning from Human Feedback (RLHF), enhancing the safety and reliability of LLMs.
Reach us at info@study.space