26 May 2024 | Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, and Amnon Shashua
The paper "Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering" by Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, and Amnon Shashua explores the trade-offs between alignment and helpfulness in language models (LMs) when using representation engineering. Representation engineering is a method that alters the model's behavior by changing its representations post-training, which has been shown to improve alignment in tasks such as resistance to adversarial attacks and reduction of social biases. However, it also reduces the model's ability to perform basic tasks.
The authors propose a theoretical framework to bound the increase in alignment and the decrease in helpfulness caused by representation engineering. They find that alignment can be guaranteed with representation engineering, but at the cost of harmfulness. Specifically, helpfulness decreases quadratically with the norm of the representation engineering vector, while alignment increases linearly. This indicates a regime where representation engineering can be cost-effective.
Empirical results support these findings, showing that alignment improves linearly with the norm of the representation engineering vector, while helpfulness decreases quadratically. The study also demonstrates that for small norms, the improvement in alignment is initially faster than the decrease in helpfulness, suggesting a regime where prompt engineering is more effective.
The paper concludes by discussing the implications of these findings for designing safe and useful LLM systems, emphasizing the importance of understanding the trade-offs between alignment and helpfulness. The code for the experiments is available at <https://github.com/dorin133/REPE_alignment_helpfulness_tradeoff>.The paper "Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering" by Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, and Amnon Shashua explores the trade-offs between alignment and helpfulness in language models (LMs) when using representation engineering. Representation engineering is a method that alters the model's behavior by changing its representations post-training, which has been shown to improve alignment in tasks such as resistance to adversarial attacks and reduction of social biases. However, it also reduces the model's ability to perform basic tasks.
The authors propose a theoretical framework to bound the increase in alignment and the decrease in helpfulness caused by representation engineering. They find that alignment can be guaranteed with representation engineering, but at the cost of harmfulness. Specifically, helpfulness decreases quadratically with the norm of the representation engineering vector, while alignment increases linearly. This indicates a regime where representation engineering can be cost-effective.
Empirical results support these findings, showing that alignment improves linearly with the norm of the representation engineering vector, while helpfulness decreases quadratically. The study also demonstrates that for small norms, the improvement in alignment is initially faster than the decrease in helpfulness, suggesting a regime where prompt engineering is more effective.
The paper concludes by discussing the implications of these findings for designing safe and useful LLM systems, emphasizing the importance of understanding the trade-offs between alignment and helpfulness. The code for the experiments is available at <https://github.com/dorin133/REPE_alignment_helpfulness_tradeoff>.