3 May 2024 | Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu
LangProp is a framework designed to optimize code generated by large language models (LLMs) in both supervised and reinforcement learning settings. It addresses the issue of sub-optimal initial code generated by LLMs by automatically evaluating the code's performance on a dataset of input-output pairs, catching exceptions, and feeding the results back to the LLM for iterative improvement. The framework adopts a metric- and data-driven training paradigm, allowing for the adaptation of traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. LangProp has been tested in various domains, including Sudoku, CartPole, and autonomous driving in CARLA, demonstrating its ability to generate interpretable and transparent policies that can be verified and improved. The framework's internal implementation is independent of specific deep learning frameworks, making it adaptable to a wide range of applications. The paper also discusses the challenges and solutions in training LangProp, including the handling of exceptions, policy reranking, and the integration of different training paradigms. The results show that LangProp can significantly improve code performance, outperforming existing methods in complex tasks such as autonomous driving.LangProp is a framework designed to optimize code generated by large language models (LLMs) in both supervised and reinforcement learning settings. It addresses the issue of sub-optimal initial code generated by LLMs by automatically evaluating the code's performance on a dataset of input-output pairs, catching exceptions, and feeding the results back to the LLM for iterative improvement. The framework adopts a metric- and data-driven training paradigm, allowing for the adaptation of traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. LangProp has been tested in various domains, including Sudoku, CartPole, and autonomous driving in CARLA, demonstrating its ability to generate interpretable and transparent policies that can be verified and improved. The framework's internal implementation is independent of specific deep learning frameworks, making it adaptable to a wide range of applications. The paper also discusses the challenges and solutions in training LangProp, including the handling of exceptions, policy reranking, and the integration of different training paradigms. The results show that LangProp can significantly improve code performance, outperforming existing methods in complex tasks such as autonomous driving.