This paper introduces In-Context Reflection (ICR), a novel method for selecting effective demonstrations in in-context learning (ICL) for large language models (LLMs). ICL enables LLMs to adapt to diverse tasks by appending question-answer pairs as prompt contexts. However, the effectiveness of ICL depends on the quality of the selected demonstrations, which remains a challenge in practice. Existing methods either rely on external supervision or require frequent interactions with LLMs, leading to high costs. ICR addresses these challenges by strategically selecting demonstrations that reduce the discrepancy between the LLM's outputs and the actual input-output mappings.
ICR works by iteratively refining a set of initial demonstrations. In each step, it analyzes a pool of candidate examples and identifies those most likely to challenge the LLM's current understanding, measured by a new metric called misconfidence. These most confusing examples are then selected to replace the less informative demonstrations in the current set. The method leverages the discrepancy between the LLM's output distribution and the task-specific input-output mappings. By constructing ICL prompts that bridge these discrepancies, ICR aims to calibrate the LLM's output distribution toward the desired task labels.
The paper evaluates ICR across five diverse datasets encompassing 13 subtasks, showing that ICR achieves an average performance boost of 4% compared to existing methods. It also demonstrates remarkable cross-task generalization capabilities. The results show that ICR consistently enhances the LLM's performance across these tasks. Furthermore, ICR is robust, as demonstrated by its performance when evaluated on different tasks from the same task family.
The main contributions of this work include: (1) proposing to leverage the difference between the output distribution of LLMs and the input-output mappings of a given task to address the drawbacks of existing demonstration selection strategies; (2) introducing misconfidence as a metric to quantify this discrepancy and presenting ICR, a method that effectively selects demonstrations that provide "lacking knowledge" to help LLMs adapt to specific tasks; and (3) demonstrating through experiments on 13 tasks from 5 task sets that prompts constructed using ICR are both effective and robust.This paper introduces In-Context Reflection (ICR), a novel method for selecting effective demonstrations in in-context learning (ICL) for large language models (LLMs). ICL enables LLMs to adapt to diverse tasks by appending question-answer pairs as prompt contexts. However, the effectiveness of ICL depends on the quality of the selected demonstrations, which remains a challenge in practice. Existing methods either rely on external supervision or require frequent interactions with LLMs, leading to high costs. ICR addresses these challenges by strategically selecting demonstrations that reduce the discrepancy between the LLM's outputs and the actual input-output mappings.
ICR works by iteratively refining a set of initial demonstrations. In each step, it analyzes a pool of candidate examples and identifies those most likely to challenge the LLM's current understanding, measured by a new metric called misconfidence. These most confusing examples are then selected to replace the less informative demonstrations in the current set. The method leverages the discrepancy between the LLM's output distribution and the task-specific input-output mappings. By constructing ICL prompts that bridge these discrepancies, ICR aims to calibrate the LLM's output distribution toward the desired task labels.
The paper evaluates ICR across five diverse datasets encompassing 13 subtasks, showing that ICR achieves an average performance boost of 4% compared to existing methods. It also demonstrates remarkable cross-task generalization capabilities. The results show that ICR consistently enhances the LLM's performance across these tasks. Furthermore, ICR is robust, as demonstrated by its performance when evaluated on different tasks from the same task family.
The main contributions of this work include: (1) proposing to leverage the difference between the output distribution of LLMs and the input-output mappings of a given task to address the drawbacks of existing demonstration selection strategies; (2) introducing misconfidence as a metric to quantify this discrepancy and presenting ICR, a method that effectively selects demonstrations that provide "lacking knowledge" to help LLMs adapt to specific tasks; and (3) demonstrating through experiments on 13 tasks from 5 task sets that prompts constructed using ICR are both effective and robust.