Spectral Editing of Activations for Large Language Model Alignment

Spectral Editing of Activations for Large Language Model Alignment

25 May 2024 | Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen
The paper introduces Spectral Editing of Activations (SEA), a novel training-free method to edit the internal representations of large language models (LLMs) during inference. SEA aims to improve the truthfulness and fairness of LLMs by projecting input representations towards directions that maximize covariance with positive demonstrations (e.g., truthful content) while minimizing covariance with negative demonstrations (e.g., hallucinated content). The method uses spectral decomposition to find the optimal editing projections, which can be applied linearly or non-linearly using invertible feature functions. Extensive experiments on various benchmarks and LLMs of different sizes and architectures demonstrate the effectiveness of SEA in improving truthfulness and fairness while maintaining high inference efficiency and data efficiency. The method also shows limited negative impacts on other model capabilities, such as commonsense reasoning and mathematical ability.The paper introduces Spectral Editing of Activations (SEA), a novel training-free method to edit the internal representations of large language models (LLMs) during inference. SEA aims to improve the truthfulness and fairness of LLMs by projecting input representations towards directions that maximize covariance with positive demonstrations (e.g., truthful content) while minimizing covariance with negative demonstrations (e.g., hallucinated content). The method uses spectral decomposition to find the optimal editing projections, which can be applied linearly or non-linearly using invertible feature functions. Extensive experiments on various benchmarks and LLMs of different sizes and architectures demonstrate the effectiveness of SEA in improving truthfulness and fairness while maintaining high inference efficiency and data efficiency. The method also shows limited negative impacts on other model capabilities, such as commonsense reasoning and mathematical ability.
Reach us at info@study.space
Understanding Spectral Editing of Activations for Large Language Model Alignment