Understanding DeAL%3A Decoding-time Alignment for Large Language Models

**DeAL: Decoding-time Alignment for Large Language Models** **Abstract:** Large Language Models (LLMs) are expected to generate content aligned with human preferences. Current methods focus on alignment during model training, using techniques like Reinforcement Learning with Human Feedback (RLHF). However, these methods have limitations, including the inability to incorporate multiple custom rewards and the reliance on a model developer's view of universal and static principles. Additionally, residual gaps in model training and the reliability of such approaches are questionable. To address these issues, we propose DeAL, a framework that allows users to customize reward functions and enables Decoding-time ALignment of LLMs (DeAL). At its core, DeAL views decoding as a heuristic-guided search process, facilitating the use of various alignment objectives. Experiments with programmatic constraints and abstract objectives show that DeAL can improve adherence to alignment objectives, address residual gaps, and enhance fine-grained trade-offs. While DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization left for future work. **Introduction:** Auto-regressive LLMs, such as GPT, PaLM, and Llama, are capable of performing a wide range of natural language processing tasks without extensive task-specific fine-tuning. However, aligning these models to specific objectives or principles remains a challenge. Current approaches often use human-labeled preference data during fine-tuning, but these methods have limitations, including the inability to handle non-universal and custom alignment objectives, the need for fine-tuning and maintenance of custom models, and the reliability of alignment during generation. **Method:** DeAL frames text generation as a search problem, where LLMs act as search agents. The search problem is defined with a state space, action set, transition function, and reward function. DeAL uses an A* search algorithm, incorporating prompting techniques and sophisticated alignment heuristics. The framework allows for start-state adaptation and action selection, enabling the use of both programmatically verifiable constraints and abstract alignment objectives. **Experiments:** Experiments demonstrate that DeAL improves adherence to alignment objectives without affecting task performance. It shows better keyword coverage, length satisfaction, and harmlessness and helpfulness in various tasks. DeAL also complements existing alignment techniques, such as system prompts and fine-tuning with preference data, and provides a more effective and flexible solution in security scenarios where existing approaches can be easily bypassed. **Conclusion:** DeAL offers a framework for aligning LLMs to diverse objectives at decoding time, improving adherence to alignment objectives and addressing limitations of current methods. It can be used in conjunction with existing alignment techniques and provides significant benefits in security contexts.**DeAL: Decoding-time Alignment for Large Language Models** **Abstract:** Large Language Models (LLMs) are expected to generate content aligned with human preferences. Current methods focus on alignment during model training, using techniques like Reinforcement Learning with Human Feedback (RLHF). However, these methods have limitations, including the inability to incorporate multiple custom rewards and the reliance on a model developer's view of universal and static principles. Additionally, residual gaps in model training and the reliability of such approaches are questionable. To address these issues, we propose DeAL, a framework that allows users to customize reward functions and enables Decoding-time ALignment of LLMs (DeAL). At its core, DeAL views decoding as a heuristic-guided search process, facilitating the use of various alignment objectives. Experiments with programmatic constraints and abstract objectives show that DeAL can improve adherence to alignment objectives, address residual gaps, and enhance fine-grained trade-offs. While DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization left for future work. **Introduction:** Auto-regressive LLMs, such as GPT, PaLM, and Llama, are capable of performing a wide range of natural language processing tasks without extensive task-specific fine-tuning. However, aligning these models to specific objectives or principles remains a challenge. Current approaches often use human-labeled preference data during fine-tuning, but these methods have limitations, including the inability to handle non-universal and custom alignment objectives, the need for fine-tuning and maintenance of custom models, and the reliability of alignment during generation. **Method:** DeAL frames text generation as a search problem, where LLMs act as search agents. The search problem is defined with a state space, action set, transition function, and reward function. DeAL uses an A* search algorithm, incorporating prompting techniques and sophisticated alignment heuristics. The framework allows for start-state adaptation and action selection, enabling the use of both programmatically verifiable constraints and abstract alignment objectives. **Experiments:** Experiments demonstrate that DeAL improves adherence to alignment objectives without affecting task performance. It shows better keyword coverage, length satisfaction, and harmlessness and helpfulness in various tasks. DeAL also complements existing alignment techniques, such as system prompts and fine-tuning with preference data, and provides a more effective and flexible solution in security scenarios where existing approaches can be easily bypassed. **Conclusion:** DeAL offers a framework for aligning LLMs to diverse objectives at decoding time, improving adherence to alignment objectives and addressing limitations of current methods. It can be used in conjunction with existing alignment techniques and provides significant benefits in security contexts.

DeAL: Decoding-time Alignment for Large Language Models

21 Feb 2024 | James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, Dan Roth