Understanding Finding Transformer Circuits with Edge Pruning

Edge Pruning is a method for discovering circuits in language models by pruning edges between components rather than neurons or components. This approach allows for more precise and scalable circuit discovery compared to previous methods. Edge Pruning uses gradient-based pruning techniques to find circuits that are more faithful to the full model's predictions while using significantly fewer edges. It is efficient even with large datasets and has been successfully applied to models like GPT-2 and CodeLlama-13B. Edge Pruning outperforms previous methods in terms of speed and circuit fidelity, and it can recover ground-truth circuits in models compiled with Tracr. In a case study, Edge Pruning was used to compare the mechanisms behind instruction prompting and in-context learning in CodeLlama-13B, revealing substantial overlap in the circuits used for both tasks. The results show that Edge Pruning is a practical and scalable tool for interpretability, shedding light on behaviors that emerge in large models. The method is effective in finding sparse circuits that match the performance of the full model, demonstrating its utility in understanding and analyzing large language models.Edge Pruning is a method for discovering circuits in language models by pruning edges between components rather than neurons or components. This approach allows for more precise and scalable circuit discovery compared to previous methods. Edge Pruning uses gradient-based pruning techniques to find circuits that are more faithful to the full model's predictions while using significantly fewer edges. It is efficient even with large datasets and has been successfully applied to models like GPT-2 and CodeLlama-13B. Edge Pruning outperforms previous methods in terms of speed and circuit fidelity, and it can recover ground-truth circuits in models compiled with Tracr. In a case study, Edge Pruning was used to compare the mechanisms behind instruction prompting and in-context learning in CodeLlama-13B, revealing substantial overlap in the circuits used for both tasks. The results show that Edge Pruning is a practical and scalable tool for interpretability, shedding light on behaviors that emerge in large models. The method is effective in finding sparse circuits that match the performance of the full model, demonstrating its utility in understanding and analyzing large language models.

Finding Transformer Circuits with Edge Pruning

24 Jun 2024 | Adithya Bhaskar, Alexander Wettig, Dan Friedman, Danqi Chen