[slides and audio] CodeIt%3A Self-Improving Language Models with Prioritized Hindsight Replay

The paper introduces Code Iteration (CodeIt), a novel and scalable method for self-improving language models, specifically designed to solve the Abstraction and Reasoning Corpus (ARC) benchmark. CodeIt addresses the challenge of sparse rewards in program synthesis by iteratively sampling programs, relabeling them based on their performance, and learning from prioritized experience replay. The method combines a pre-trained large language model (LLM) with a domain-specific language (DSL) to leverage prior knowledge and data. By incorporating expert iteration (ExIt), CodeIt effectively generalizes between tasks and achieves state-of-the-art performance on the ARC evaluation set, solving 59 out of 400 tasks. The approach demonstrates the ability to refine solutions over time and outperforms both neural and symbolic baselines. Ablations show that hindsight relabeling and prioritized sampling are crucial for improving sample efficiency and preventing catastrophic forgetting. The paper also discusses the limitations and potential improvements, emphasizing the importance of combining prior knowledge and experience for effective learning in sparse-reward settings.The paper introduces Code Iteration (CodeIt), a novel and scalable method for self-improving language models, specifically designed to solve the Abstraction and Reasoning Corpus (ARC) benchmark. CodeIt addresses the challenge of sparse rewards in program synthesis by iteratively sampling programs, relabeling them based on their performance, and learning from prioritized experience replay. The method combines a pre-trained large language model (LLM) with a domain-specific language (DSL) to leverage prior knowledge and data. By incorporating expert iteration (ExIt), CodeIt effectively generalizes between tasks and achieves state-of-the-art performance on the ARC evaluation set, solving 59 out of 400 tasks. The approach demonstrates the ability to refine solutions over time and outperforms both neural and symbolic baselines. Ablations show that hindsight relabeling and prioritized sampling are crucial for improving sample efficiency and preventing catastrophic forgetting. The paper also discusses the limitations and potential improvements, emphasizing the importance of combining prior knowledge and experience for effective learning in sparse-reward settings.

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

2024 | Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michael Defferrard, Taco Cohen