This survey by Haoyan Luo and Lucia Specia from Imperial College London focuses on the critical yet challenging aspect of explainability in Large Language Models (LLMs). As LLMs become increasingly integral to various applications, their "black-box" nature raises concerns about transparency and ethical use. The authors emphasize the importance of increased explainability in LLMs, delving into research on explainability and methodologies that utilize an understanding of these models. The survey primarily focuses on pre-trained Transformer-based LLMs, such as LLaMA, which pose unique interpretability challenges due to their scale and complexity.
The survey categorizes existing methods into local and global analyses based on their explanatory objectives. Local analyses include feature attribution and transformer block analysis, while global analyses encompass probing-based methods and mechanistic interpretability. The authors explore several methods for leveraging explainability, including model editing, control generation, and model enhancement. They also examine evaluation metrics and datasets, highlighting their advantages and limitations.
The goal of the survey is to bridge the gap between theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era. The introduction highlights the urgency for improved explainability to foster trust, improve model performance, and address issues like model hallucinations and inherent biases. The overview section categorizes current explainability approaches and proposes research questions for future exploration. The detailed sections on local and global analysis provide a structured review of various methods and their applications. The leveraged explainability section discusses how explainability can be used to debug and improve models, focusing on methods specifically designed with a strong foundation in model explainability. The evaluation section addresses the need for designed evaluation methods and calibrated datasets to assess the performance of explainability methods and their applications in downstream tasks.This survey by Haoyan Luo and Lucia Specia from Imperial College London focuses on the critical yet challenging aspect of explainability in Large Language Models (LLMs). As LLMs become increasingly integral to various applications, their "black-box" nature raises concerns about transparency and ethical use. The authors emphasize the importance of increased explainability in LLMs, delving into research on explainability and methodologies that utilize an understanding of these models. The survey primarily focuses on pre-trained Transformer-based LLMs, such as LLaMA, which pose unique interpretability challenges due to their scale and complexity.
The survey categorizes existing methods into local and global analyses based on their explanatory objectives. Local analyses include feature attribution and transformer block analysis, while global analyses encompass probing-based methods and mechanistic interpretability. The authors explore several methods for leveraging explainability, including model editing, control generation, and model enhancement. They also examine evaluation metrics and datasets, highlighting their advantages and limitations.
The goal of the survey is to bridge the gap between theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era. The introduction highlights the urgency for improved explainability to foster trust, improve model performance, and address issues like model hallucinations and inherent biases. The overview section categorizes current explainability approaches and proposes research questions for future exploration. The detailed sections on local and global analysis provide a structured review of various methods and their applications. The leveraged explainability section discusses how explainability can be used to debug and improve models, focusing on methods specifically designed with a strong foundation in model explainability. The evaluation section addresses the need for designed evaluation methods and calibrated datasets to assess the performance of explainability methods and their applications in downstream tasks.