13 Jun 2024 | Jack Merullo, Carsten Eickhoff, Ellie Pavlick
The paper "Talking Heads: Understanding Inter-layer Communication in Transformer Language Models" by Jack Merullo explores the mechanisms by which information is passed from early layers to later layers in transformer language models (LMs). The authors find that this information is represented and routed through low-rank subspaces of the residual stream, forming "communication channels" between layers. By decomposing attention head weight matrices using Singular Value Decomposition (SVD), they identify specific interactions between heads separated by one or more layers, such as inhibition and duplicate detection. These interactions are found to be crucial for the model's sensitivity to the order of items in prompts, particularly in tasks like the Laundry List task, where the model struggles to recall items as the list length increases. The paper demonstrates that manipulating these internal model representations and editing model weights based on the discovered mechanisms can significantly improve performance on synthetic tasks, improving accuracy by over 20%. The findings reveal a complex, content-independent structure learned during pretraining and provide insights into why sophisticated LMs sometimes fail in simple domains, facilitating future research on model interpretability and responsible deployment.The paper "Talking Heads: Understanding Inter-layer Communication in Transformer Language Models" by Jack Merullo explores the mechanisms by which information is passed from early layers to later layers in transformer language models (LMs). The authors find that this information is represented and routed through low-rank subspaces of the residual stream, forming "communication channels" between layers. By decomposing attention head weight matrices using Singular Value Decomposition (SVD), they identify specific interactions between heads separated by one or more layers, such as inhibition and duplicate detection. These interactions are found to be crucial for the model's sensitivity to the order of items in prompts, particularly in tasks like the Laundry List task, where the model struggles to recall items as the list length increases. The paper demonstrates that manipulating these internal model representations and editing model weights based on the discovered mechanisms can significantly improve performance on synthetic tasks, improving accuracy by over 20%. The findings reveal a complex, content-independent structure learned during pretraining and provide insights into why sophisticated LMs sometimes fail in simple domains, facilitating future research on model interpretability and responsible deployment.