[slides] A Primer on the Inner Workings of Transformer-based Language Models

This primer provides a concise technical introduction to the techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. It begins by introducing the Transformer language model and its components, including the attention mechanism and feedforward network. The primer then delves into interpretability techniques, categorizing them into input attribution and model component attribution. Input attribution methods estimate the contribution of input elements to model predictions, while model component attribution methods measure the importance of individual components in the prediction process. The primer also discusses information decoding techniques, such as probing, which analyze the internal representations to understand what information is being extracted and processed by the model. Finally, it presents a comprehensive overview of known internal mechanisms implemented by these models, highlighting connections across popular approaches and active research directions.This primer provides a concise technical introduction to the techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. It begins by introducing the Transformer language model and its components, including the attention mechanism and feedforward network. The primer then delves into interpretability techniques, categorizing them into input attribution and model component attribution. Input attribution methods estimate the contribution of input elements to model predictions, while model component attribution methods measure the importance of individual components in the prediction process. The primer also discusses information decoding techniques, such as probing, which analyze the internal representations to understand what information is being extracted and processed by the model. Finally, it presents a comprehensive overview of known internal mechanisms implemented by these models, highlighting connections across popular approaches and active research directions.

A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS

13 Oct 2024 | Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà