Opening the AI black box: program synthesis via mechanistic interpretability

Opening the AI black box: program synthesis via mechanistic interpretability

7 Feb 2024 | Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Keirikhah, Mateja Vukelić, Max Tegmark
MIPS is a novel method for program synthesis based on automated mechanistic interpretability of neural networks. The approach involves training a neural network to perform a task, then automatically simplifying the network and distilling the learned algorithm into Python code. The method uses an integer autoencoder to convert the neural network into a finite state machine, followed by symbolic regression to capture the learned algorithm. MIPS is highly complementary to GPT-4, solving 32 of 62 algorithmic tasks, including 13 that GPT-4 cannot solve. The method uses automated neural architecture search to find the simplest network, followed by auto-simplification techniques to reduce the network to a finite state machine. Boolean and integer autoencoders are then used to extract the learned algorithm into Python code. The results show that MIPS can effectively distill learned algorithms into interpretable Python code, making machine-learned models more interpretable and trustworthy. The work highlights the potential of mechanistic interpretability in making AI systems more transparent and reliable.MIPS is a novel method for program synthesis based on automated mechanistic interpretability of neural networks. The approach involves training a neural network to perform a task, then automatically simplifying the network and distilling the learned algorithm into Python code. The method uses an integer autoencoder to convert the neural network into a finite state machine, followed by symbolic regression to capture the learned algorithm. MIPS is highly complementary to GPT-4, solving 32 of 62 algorithmic tasks, including 13 that GPT-4 cannot solve. The method uses automated neural architecture search to find the simplest network, followed by auto-simplification techniques to reduce the network to a finite state machine. Boolean and integer autoencoders are then used to extract the learned algorithm into Python code. The results show that MIPS can effectively distill learned algorithms into interpretable Python code, making machine-learned models more interpretable and trustworthy. The work highlights the potential of mechanistic interpretability in making AI systems more transparent and reliable.
Reach us at info@study.space
[slides] Opening the AI black box%3A program synthesis via mechanistic interpretability | StudySpace