SELF-PLAY WITH EXECUTION FEEDBACK: IMPROVING INSTRUCTION-FOLLOWING CAPABILITIES OF LARGE LANGUAGE MODELS

SELF-PLAY WITH EXECUTION FEEDBACK: IMPROVING INSTRUCTION-FOLLOWING CAPABILITIES OF LARGE LANGUAGE MODELS

18 Jul 2024 | Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
AUTOIF is a scalable and reliable method for automatically generating instruction-following training data for large language models (LLMs). The method leverages code verification to ensure the correctness of instruction responses. It involves generating instructions that can be verified by code, creating corresponding verification codes, and using execution feedback-based rejection sampling to generate data for training. AUTOIF significantly improves the instruction-following capabilities of LLMs, including Qwen2 and LLaMA3, across various training algorithms such as SFT, Offline DPO, and Online DPO. The method is evaluated on benchmarks like IFEval and FollowBench, achieving high accuracy rates. AUTOIF also enables the creation of a large-scale open-source dataset for complex instruction-following tasks. The method is designed to be self-sufficient, using LLMs to generate verification functions and unit tests, and to continuously improve instruction-following capabilities through on-policy learning. The results show that AUTOIF enhances the performance of LLMs in following instructions, with significant improvements in both accuracy and generalization. The method is also effective in preserving general abilities such as mathematical reasoning and coding. AUTOIF is a promising approach for improving the instruction-following capabilities of LLMs without relying on manual annotation.AUTOIF is a scalable and reliable method for automatically generating instruction-following training data for large language models (LLMs). The method leverages code verification to ensure the correctness of instruction responses. It involves generating instructions that can be verified by code, creating corresponding verification codes, and using execution feedback-based rejection sampling to generate data for training. AUTOIF significantly improves the instruction-following capabilities of LLMs, including Qwen2 and LLaMA3, across various training algorithms such as SFT, Offline DPO, and Online DPO. The method is evaluated on benchmarks like IFEval and FollowBench, achieving high accuracy rates. AUTOIF also enables the creation of a large-scale open-source dataset for complex instruction-following tasks. The method is designed to be self-sufficient, using LLMs to generate verification functions and unit tests, and to continuously improve instruction-following capabilities through on-policy learning. The results show that AUTOIF enhances the performance of LLMs in following instructions, with significant improvements in both accuracy and generalization. The method is also effective in preserving general abilities such as mathematical reasoning and coding. AUTOIF is a promising approach for improving the instruction-following capabilities of LLMs without relying on manual annotation.
Reach us at info@study.space
Understanding Self-play with Execution Feedback%3A Improving Instruction-following Capabilities of Large Language Models