SELF-PLAY WITH EXECUTION FEEDBACK: IMPROVING INSTRUCTION-FOLLOWING CAPABILITIES OF LARGE LANGUAGE MODELS

SELF-PLAY WITH EXECUTION FEEDBACK: IMPROVING INSTRUCTION-FOLLOWING CAPABILITIES OF LARGE LANGUAGE MODELS

18 Jul 2024 | Guanting Dong*, Keming Lu, Chengpeng Li*, Tingyu Xia*, Bowen Yu† Chang Zhou, Jingren Zhou
This paper introduces AUTOIF, a scalable and reliable method for automatically generating instruction-following training data for large language models (LLMs). AUTOIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, corresponding verification codes, and unit test samples. Execution feedback-based rejection sampling is then used to generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. The method achieves significant improvements across three training algorithms—SFT, Offline DPO, and Online DPO—when applied to top open-source LLMs, Qwen2 and LLaMA3, in both self-alignment and strong-to-weak distillation settings. The code for AUTOIF is publicly available at <https://github.com/QwenLM/AutoIF>.This paper introduces AUTOIF, a scalable and reliable method for automatically generating instruction-following training data for large language models (LLMs). AUTOIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, corresponding verification codes, and unit test samples. Execution feedback-based rejection sampling is then used to generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. The method achieves significant improvements across three training algorithms—SFT, Offline DPO, and Online DPO—when applied to top open-source LLMs, Qwen2 and LLaMA3, in both self-alignment and strong-to-weak distillation settings. The code for AUTOIF is publicly available at <https://github.com/QwenLM/AutoIF>.
Reach us at info@study.space