2 Feb 2024 | Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, Tobias Hecking
This research explores methods to steer the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. The study demonstrates that style vectors can be computed from recorded layer activations for input texts in a specific style, contrasting more complex training-based approaches. Through experiments, the effectiveness of activation engineering using style vectors is shown to influence the style of generated text in a nuanced and parameterizable way, distinguishing it from prompt engineering. The research aims to bridge the gap between LLM capabilities and the nuanced requirements of human-AI interactions, extending the control over LLM outputs.
Large language models (LLMs) have marked significant advancements in natural language processing, with models like GPT-2, GPT-3, and GPT-4 becoming influential in text generation. These models can encode extensive public knowledge and respond to various text prompts, often resembling human communication. However, the output of LLMs is often limited to lexical level, and more sophisticated control over affective and emotional aspects is needed for effective human-AI interaction. Prompt engineering has been a promising approach but is highly task-specific and requires manual crafting of prompts. This paper builds on previous works by focusing on steering LLMs by modifying their internal states using style vectors.
The research investigates two main approaches to calculate style vectors: training-based style vectors and activation-based style vectors. Training-based style vectors are derived from steering vectors learned during a training process, while activation-based style vectors are derived from the activations of input prompts. The study compares these methods in terms of their ability to encode style information and steer the model's output. Experiments are conducted using datasets labeled with sentiment, emotion, and writing style categories. The results show that activation-based style vectors are more effective in steering the output of LLMs, providing smoother transitions and more nuanced control compared to prompt engineering.
The research concludes that activation-based style vectors are preferred for steering LLMs due to their performance and resource efficiency. The method enables continuous and adjustable modulation of LLM outputs, offering smoother transitions and the potential to generate new styles. The study also highlights the importance of determining the exact influence of the weighting parameter λ, which allows for nuanced style steering but can lead to nonsensical outputs if chosen too large. The findings suggest that combining activation-based and prompt engineering approaches can enhance the overall capability and flexibility of LLMs.This research explores methods to steer the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. The study demonstrates that style vectors can be computed from recorded layer activations for input texts in a specific style, contrasting more complex training-based approaches. Through experiments, the effectiveness of activation engineering using style vectors is shown to influence the style of generated text in a nuanced and parameterizable way, distinguishing it from prompt engineering. The research aims to bridge the gap between LLM capabilities and the nuanced requirements of human-AI interactions, extending the control over LLM outputs.
Large language models (LLMs) have marked significant advancements in natural language processing, with models like GPT-2, GPT-3, and GPT-4 becoming influential in text generation. These models can encode extensive public knowledge and respond to various text prompts, often resembling human communication. However, the output of LLMs is often limited to lexical level, and more sophisticated control over affective and emotional aspects is needed for effective human-AI interaction. Prompt engineering has been a promising approach but is highly task-specific and requires manual crafting of prompts. This paper builds on previous works by focusing on steering LLMs by modifying their internal states using style vectors.
The research investigates two main approaches to calculate style vectors: training-based style vectors and activation-based style vectors. Training-based style vectors are derived from steering vectors learned during a training process, while activation-based style vectors are derived from the activations of input prompts. The study compares these methods in terms of their ability to encode style information and steer the model's output. Experiments are conducted using datasets labeled with sentiment, emotion, and writing style categories. The results show that activation-based style vectors are more effective in steering the output of LLMs, providing smoother transitions and more nuanced control compared to prompt engineering.
The research concludes that activation-based style vectors are preferred for steering LLMs due to their performance and resource efficiency. The method enables continuous and adjustable modulation of LLM outputs, offering smoother transitions and the potential to generate new styles. The study also highlights the importance of determining the exact influence of the weighting parameter λ, which allows for nuanced style steering but can lead to nonsensical outputs if chosen too large. The findings suggest that combining activation-based and prompt engineering approaches can enhance the overall capability and flexibility of LLMs.