9 Feb 2024 | Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon
This paper introduces LEAP, a novel approach for in-context learning that improves the performance of large language models (LLMs) by learning from mistakes. Unlike traditional few-shot prompting, which only uses correct examples, LEAP intentionally induces the model to make mistakes on given examples, then has the model reflect on these mistakes to learn explicit task-specific principles. These principles help the model avoid similar mistakes in the future and are used to answer unseen test questions. LEAP is evaluated on a wide range of benchmarks, including multi-hop question answering (HotpotQA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH). In all these benchmarks, LEAP improves the performance of strong LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo, and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP and by 3.3% in HotpotQA. LEAP does not require any additional input or examples beyond the standard few-shot prompting settings. The key idea of LEAP is to learn principles from mistakes, which can then be used to improve the model's reasoning and accuracy in future tasks. The paper also compares LEAP with other approaches and shows that it outperforms them in various reasoning tasks. The results suggest that LEAP revolutionizes the traditional concept of few-shot in-context learning by leveraging the recently emerged abilities of LLMs to follow instructions and explain mistakes given the correct answer or feedback.This paper introduces LEAP, a novel approach for in-context learning that improves the performance of large language models (LLMs) by learning from mistakes. Unlike traditional few-shot prompting, which only uses correct examples, LEAP intentionally induces the model to make mistakes on given examples, then has the model reflect on these mistakes to learn explicit task-specific principles. These principles help the model avoid similar mistakes in the future and are used to answer unseen test questions. LEAP is evaluated on a wide range of benchmarks, including multi-hop question answering (HotpotQA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH). In all these benchmarks, LEAP improves the performance of strong LLMs such as GPT-3.5-turbo, GPT-4, GPT-4-turbo, and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP and by 3.3% in HotpotQA. LEAP does not require any additional input or examples beyond the standard few-shot prompting settings. The key idea of LEAP is to learn principles from mistakes, which can then be used to improve the model's reasoning and accuracy in future tasks. The paper also compares LEAP with other approaches and shows that it outperforms them in various reasoning tasks. The results suggest that LEAP revolutionizes the traditional concept of few-shot in-context learning by leveraging the recently emerged abilities of LLMs to follow instructions and explain mistakes given the correct answer or feedback.