Understanding Opening the AI black box%3A program synthesis via mechanistic interpretability

The paper introduces MIPS (Mechanistic-Interpretability-based Program Synthesis), a novel method for program synthesis that leverages automated mechanistic interpretability of neural networks to distill learned algorithms into Python code. MIPS involves training a recurrent neural network (RNN) to perform a task, simplifying the learned network, extracting a finite state machine from the RNN, and then using symbolic regression to convert the learned algorithm into Python code. The method is tested on a benchmark of 62 algorithmic tasks, where MIPS solves 32 tasks, including 13 that are not solved by GPT-4. The paper discusses the advantages and challenges of scaling up this approach to make machine-learned models more interpretable and trustworthy.The paper introduces MIPS (Mechanistic-Interpretability-based Program Synthesis), a novel method for program synthesis that leverages automated mechanistic interpretability of neural networks to distill learned algorithms into Python code. MIPS involves training a recurrent neural network (RNN) to perform a task, simplifying the learned network, extracting a finite state machine from the RNN, and then using symbolic regression to convert the learned algorithm into Python code. The method is tested on a benchmark of 62 algorithmic tasks, where MIPS solves 32 tasks, including 13 that are not solved by GPT-4. The paper discusses the advantages and challenges of scaling up this approach to make machine-learned models more interpretable and trustworthy.

Opening the AI black box: program synthesis via mechanistic interpretability

7 Feb 2024 | Eric J. Michaud * 1 2 Isaac Liao * 1 Vedang Lad * 1 Ziming Liu * 1 2 Anish Mudide 1 Chloe Loughridge 3 Zifan Carl Guo 1 Tara Rezaei Kheirkhah 1 Mateja Vukelić 1 Max Tegmark 1 2