SelfIE: Self-Interpretation of Large Language Model Embeddings

SelfIE: Self-Interpretation of Large Language Model Embeddings

26 Mar 2024 | Haozhe Chen, Carl Vondrick, Chengzhi Mao
SelfIE is a framework that enables large language models (LLMs) to interpret their own embeddings in natural language. By leveraging the LLM's ability to respond to inquiries about a given passage, SelfIE can interpret open-world concepts in hidden embeddings, revealing internal reasoning processes in scenarios such as ethical decisions, prompt injection, and recalling harmful knowledge. SelfIE's text descriptions of hidden embeddings open avenues for controlling LLM reasoning. The framework proposes Supervised Control, which allows editing open-ended concepts with minimal gradient computation, and Reinforcement Control, which erases harmful knowledge without supervision targets. SelfIE interprets hidden embeddings by inserting them into a separate forward pass of the LLM, enabling open-world interpretation without additional training. The framework's key advantage is its ability to interpret high-level, open-world concepts in embeddings, making it compatible across current and future language model advancements. SelfIE's capability to describe hidden embeddings with text opens new avenues for lightweight control of model behaviors. Experiments show that SelfIE's interpretation framework faithfully conveys information in hidden embeddings and reveals internal reasoning procedures in LLMs. SelfIE achieves the same performance as prior supervised approaches in eliciting LLM's internal representation of world state in TextWorld. SelfIE reveals internal reasoning processes behind complex LLM behaviors, including identifying harmful knowledge, understanding prompt injections, and explaining ethical decisions. SelfIE interpretations enable locating and modifying of individual layers to control LLM reasoning behaviors such as erasing harmful knowledge and overriding ethical steering. By removing harmful knowledge inside LLM, the success rate of prompt injection eliciting harmful responses was reduced by 84.66%. The method also increased fairness in LLM responses by achieving a 95% effective rate of overriding user ethical steering. SelfIE enables new modes of precise control over model behaviors in the latent space. It supports open-ended editing targets and extends RLHF to embedding level for granular control of model reasoning without supervised targets. The framework also allows for the interpretation of complex concepts and the detection of harmful knowledge in LLMs. SelfIE-based control alters the model's open-ended perception of concepts and generalizes the perception to complex reasoning. Reinforcement Control leverages SelfIE interpretation to remove harmful knowledge in LLMs by specifying this objective with an evaluator LLM. The method effectively reduces the success rate of prompt injection eliciting harmful responses and preserves other capabilities of LLMs. SelfIE's ability to interpret hidden embeddings enables new methods for controlling model reasoning and understanding LLM behaviors.SelfIE is a framework that enables large language models (LLMs) to interpret their own embeddings in natural language. By leveraging the LLM's ability to respond to inquiries about a given passage, SelfIE can interpret open-world concepts in hidden embeddings, revealing internal reasoning processes in scenarios such as ethical decisions, prompt injection, and recalling harmful knowledge. SelfIE's text descriptions of hidden embeddings open avenues for controlling LLM reasoning. The framework proposes Supervised Control, which allows editing open-ended concepts with minimal gradient computation, and Reinforcement Control, which erases harmful knowledge without supervision targets. SelfIE interprets hidden embeddings by inserting them into a separate forward pass of the LLM, enabling open-world interpretation without additional training. The framework's key advantage is its ability to interpret high-level, open-world concepts in embeddings, making it compatible across current and future language model advancements. SelfIE's capability to describe hidden embeddings with text opens new avenues for lightweight control of model behaviors. Experiments show that SelfIE's interpretation framework faithfully conveys information in hidden embeddings and reveals internal reasoning procedures in LLMs. SelfIE achieves the same performance as prior supervised approaches in eliciting LLM's internal representation of world state in TextWorld. SelfIE reveals internal reasoning processes behind complex LLM behaviors, including identifying harmful knowledge, understanding prompt injections, and explaining ethical decisions. SelfIE interpretations enable locating and modifying of individual layers to control LLM reasoning behaviors such as erasing harmful knowledge and overriding ethical steering. By removing harmful knowledge inside LLM, the success rate of prompt injection eliciting harmful responses was reduced by 84.66%. The method also increased fairness in LLM responses by achieving a 95% effective rate of overriding user ethical steering. SelfIE enables new modes of precise control over model behaviors in the latent space. It supports open-ended editing targets and extends RLHF to embedding level for granular control of model reasoning without supervised targets. The framework also allows for the interpretation of complex concepts and the detection of harmful knowledge in LLMs. SelfIE-based control alters the model's open-ended perception of concepts and generalizes the perception to complex reasoning. Reinforcement Control leverages SelfIE interpretation to remove harmful knowledge in LLMs by specifying this objective with an evaluator LLM. The method effectively reduces the success rate of prompt injection eliciting harmful responses and preserves other capabilities of LLMs. SelfIE's ability to interpret hidden embeddings enables new methods for controlling model reasoning and understanding LLM behaviors.
Reach us at info@study.space