**InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance**
**Authors:** Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
**Institution:** School of Computer Science, Fudan University; Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
**Abstract:**
The rapid development of large language models (LLMs) has led to their widespread use in various applications. However, ensuring these models align with human values and intentions is crucial. Current alignment methods, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), focus on training-time alignment and are often complex and resource-intensive. To address this, InferAligner is introduced, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. InferAligner leverages safety steering vectors (SSVs) extracted from safety-aligned models to modify the activations of the target model when responding to harmful inputs, guiding it to provide harmless responses. Experimental results demonstrate that InferAligner effectively enhances the safety of domain-specific models in finance, medicine, and mathematics, as well as multimodal large language models (MLLMs) like LLaVA, while maintaining almost unchanged performance in downstream tasks. The method significantly reduces the Attack Success Rate (ASR) of harmful instructions and jailbreak attacks.
**Contributions:**
- Proposes InferAligner, a novel inference-time alignment method that enhances model safety without affecting downstream performance.
- Simple to use and effective even without aligned models.
- First to explore harmlessness alignment for MLLMs and introduces MM-Harmful Bench, a dedicated dataset for safety research.
**Related Work:**
- Categorizes LLM alignment into training-time and inference-time alignment methods.
- Discusses safety concerns and activation engineering techniques.
**Method:**
- Extracts safety-related vectors (SRVs) from aligned models to guide the target model's activations during inference.
- Uses a guidance gate to control the activation shift based on the intent of the input.
- Shifts activations using SSVs and the guidance gate to guide the target model to respond safely.
**Experimental Setup:**
- Uses datasets for SRVs, domain-specific fine-tuning, and security evaluation.
- Evaluates harmfulness and utility metrics.
**Results:**
- Shows InferAligner outperforms baselines in reducing ASR and maintaining utility.
- Demonstrates effectiveness on multimodal models like LLaVA.
**Analysis:**
- Ablation study and scalability experiments validate InferAligner's robustness.
- Discusses the importance of using SSVs from aligned models.
- Explains InferAligner's adaptability to different models and series.
**Conclusion:**
InferAligner is a highly effective inference-time alignment method for harmlessness, significantly reducing ASR while maintaining downstream performance**InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance**
**Authors:** Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
**Institution:** School of Computer Science, Fudan University; Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
**Abstract:**
The rapid development of large language models (LLMs) has led to their widespread use in various applications. However, ensuring these models align with human values and intentions is crucial. Current alignment methods, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), focus on training-time alignment and are often complex and resource-intensive. To address this, InferAligner is introduced, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. InferAligner leverages safety steering vectors (SSVs) extracted from safety-aligned models to modify the activations of the target model when responding to harmful inputs, guiding it to provide harmless responses. Experimental results demonstrate that InferAligner effectively enhances the safety of domain-specific models in finance, medicine, and mathematics, as well as multimodal large language models (MLLMs) like LLaVA, while maintaining almost unchanged performance in downstream tasks. The method significantly reduces the Attack Success Rate (ASR) of harmful instructions and jailbreak attacks.
**Contributions:**
- Proposes InferAligner, a novel inference-time alignment method that enhances model safety without affecting downstream performance.
- Simple to use and effective even without aligned models.
- First to explore harmlessness alignment for MLLMs and introduces MM-Harmful Bench, a dedicated dataset for safety research.
**Related Work:**
- Categorizes LLM alignment into training-time and inference-time alignment methods.
- Discusses safety concerns and activation engineering techniques.
**Method:**
- Extracts safety-related vectors (SRVs) from aligned models to guide the target model's activations during inference.
- Uses a guidance gate to control the activation shift based on the intent of the input.
- Shifts activations using SSVs and the guidance gate to guide the target model to respond safely.
**Experimental Setup:**
- Uses datasets for SRVs, domain-specific fine-tuning, and security evaluation.
- Evaluates harmfulness and utility metrics.
**Results:**
- Shows InferAligner outperforms baselines in reducing ASR and maintaining utility.
- Demonstrates effectiveness on multimodal models like LLaVA.
**Analysis:**
- Ablation study and scalability experiments validate InferAligner's robustness.
- Discusses the importance of using SSVs from aligned models.
- Explains InferAligner's adaptability to different models and series.
**Conclusion:**
InferAligner is a highly effective inference-time alignment method for harmlessness, significantly reducing ASR while maintaining downstream performance