[slides] Patchscopes%3A A Unifying Framework for Inspecting Hidden Representations of Language Models

The paper introduces Patchscopes, a modular framework designed to inspect the hidden representations of large language models (LLMs). Patchscopes leverage the advanced capabilities of LLMs to generate human-like text, enabling the "translation" of internal representations into natural language. The framework can be configured to query various types of information from LLM representations by "patching" them into different inference passes. This approach allows for more expressive and robust inspection methods compared to existing techniques, such as probing, vocabulary projections, and computation interventions. The paper demonstrates that many prior interpretability methods can be viewed as instances of Patchscopes and highlights their shortcomings, such as limited expressivity and inability to inspect early layers. New configurations of Patchscopes address these limitations and enable new possibilities, including fine-grained analysis of input contextualization and multi-hop reasoning error correction. Experiments on auto-regressive LLMs show that Patchscopes outperform existing methods in estimating next-token predictions, extracting specific attributes, and analyzing entity resolution in early layers. Additionally, cross-model patching with stronger models enhances inspection capabilities. The paper concludes by discussing future research directions, including improving target prompt effectiveness, exploring few-shot prompts, and applying Patchscopes to different domains and modalities.The paper introduces Patchscopes, a modular framework designed to inspect the hidden representations of large language models (LLMs). Patchscopes leverage the advanced capabilities of LLMs to generate human-like text, enabling the "translation" of internal representations into natural language. The framework can be configured to query various types of information from LLM representations by "patching" them into different inference passes. This approach allows for more expressive and robust inspection methods compared to existing techniques, such as probing, vocabulary projections, and computation interventions. The paper demonstrates that many prior interpretability methods can be viewed as instances of Patchscopes and highlights their shortcomings, such as limited expressivity and inability to inspect early layers. New configurations of Patchscopes address these limitations and enable new possibilities, including fine-grained analysis of input contextualization and multi-hop reasoning error correction. Experiments on auto-regressive LLMs show that Patchscopes outperform existing methods in estimating next-token predictions, extracting specific attributes, and analyzing entity resolution in early layers. Additionally, cross-model patching with stronger models enhances inspection capabilities. The paper concludes by discussing future research directions, including improving target prompt effectiveness, exploring few-shot prompts, and applying Patchscopes to different domains and modalities.

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

2024 | Asma Ghandeharioun * 1 Avi Caciularu * 1 Adam Pearce 1 Lucas Dixon 1 Mor Geva 1 2