Spectral Editing of Activations for Large Language Model Alignment

Spectral Editing of Activations for Large Language Model Alignment

25 May 2024 | Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen
This paper introduces Spectral Editing of Activations (SEA), a novel inference-time editing method for large language models (LLMs) to improve their alignment with human preferences. SEA projects input representations into directions with maximal covariance with positive demonstrations (e.g., truthful) while minimizing covariance with negative demonstrations (e.g., hallucinated). The method is extended to non-linear editing using feature functions. Experiments on six open-source LLMs show that SEA outperforms existing methods in truthfulness and bias reduction, with high inference efficiency and data efficiency. SEA editing has limited negative impact on other model capabilities such as commonsense and mathematical reasoning. The method involves spectral decomposition to find editing projections, which are applied during inference to manipulate predictions. Non-linear editing is achieved by transforming activations into a non-linear feature space and then back to the original space. SEA is shown to be effective across various LLMs and tasks, with improvements in truthfulness, fairness, and other capabilities. The method is data-efficient, requiring only 25 demonstrations to achieve significant improvements. SEA can be generalized to different LLMs and is not limited by context length. The results demonstrate that SEA is a promising approach for aligning LLMs with human preferences.This paper introduces Spectral Editing of Activations (SEA), a novel inference-time editing method for large language models (LLMs) to improve their alignment with human preferences. SEA projects input representations into directions with maximal covariance with positive demonstrations (e.g., truthful) while minimizing covariance with negative demonstrations (e.g., hallucinated). The method is extended to non-linear editing using feature functions. Experiments on six open-source LLMs show that SEA outperforms existing methods in truthfulness and bias reduction, with high inference efficiency and data efficiency. SEA editing has limited negative impact on other model capabilities such as commonsense and mathematical reasoning. The method involves spectral decomposition to find editing projections, which are applied during inference to manipulate predictions. Non-linear editing is achieved by transforming activations into a non-linear feature space and then back to the original space. SEA is shown to be effective across various LLMs and tasks, with improvements in truthfulness, fairness, and other capabilities. The method is data-efficient, requiring only 25 demonstrations to achieve significant improvements. SEA can be generalized to different LLMs and is not limited by context length. The results demonstrate that SEA is a promising approach for aligning LLMs with human preferences.
Reach us at info@study.space
[slides] Spectral Editing of Activations for Large Language Model Alignment | StudySpace