Understanding SelfIE%3A Self-Interpretation of Large Language Model Embeddings

SelfIE (Self-Interpretation of Embeddings) is a framework designed to enable large language models (LLMs) to interpret their own embeddings in natural language. By leveraging the LLMs' ability to respond to inquiries about a given passage, SelfIE can interpret open-world concepts in hidden embeddings, revealing internal reasoning processes in various scenarios such as ethical decision-making, prompt injection, and recalling harmful knowledge. The key advantage of SelfIE is its capability to interpret high-level, open-world concepts without requiring additional training or data collection, making it compatible with current and future language model advancements. SelfIE also enables granular control over model behavior by generating explanations for hidden embeddings, allowing for precise manipulation of model components at intermediate states. The framework includes Supervised Control, which allows editing open-ended concepts with minimal gradient computation, and Reinforcement Control, which erases harmful knowledge in LLMs without supervised targets. Empirical results demonstrate that SelfIE effectively elicits implicit world states, understands complex reasoning behaviors, and controls model responses, showing its potential for improving transparency and reliability in LLMs.SelfIE (Self-Interpretation of Embeddings) is a framework designed to enable large language models (LLMs) to interpret their own embeddings in natural language. By leveraging the LLMs' ability to respond to inquiries about a given passage, SelfIE can interpret open-world concepts in hidden embeddings, revealing internal reasoning processes in various scenarios such as ethical decision-making, prompt injection, and recalling harmful knowledge. The key advantage of SelfIE is its capability to interpret high-level, open-world concepts without requiring additional training or data collection, making it compatible with current and future language model advancements. SelfIE also enables granular control over model behavior by generating explanations for hidden embeddings, allowing for precise manipulation of model components at intermediate states. The framework includes Supervised Control, which allows editing open-ended concepts with minimal gradient computation, and Reinforcement Control, which erases harmful knowledge in LLMs without supervised targets. Empirical results demonstrate that SelfIE effectively elicits implicit world states, understands complex reasoning behaviors, and controls model responses, showing its potential for improving transparency and reliability in LLMs.

SelfIE: Self-Interpretation of Large Language Model Embeddings

26 Mar 2024 | Haozhe Chen, Carl Vondrick, Chengzhi Mao