30 Apr 2024 | Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, Graham Neubig
This paper explores the behavior of in-context learning (ICL) with long-context models, focusing on the performance and properties of ICL as the number of demonstrations approaches the size of entire training datasets. The study is conducted on multiple datasets and models, including Llama-2 and Mistral-7b-v0.2. Key findings include:
1. **Performance Scaling**: ICL continues to improve with hundreds or thousands of demonstrations, often approaching or exceeding the performance of finetuning on the same data.
2. **Sensitivity to Example Order**: Long-context ICL is less sensitive to example order compared to short-context ICL, suggesting that the order of examples does not significantly impact performance.
3. **Label Sorting Impact**: Grouping examples by label can negatively impact performance, especially in the long-context regime, indicating that contextualization of different labels is crucial.
4. **Relevance of Examples**: The effectiveness of long-context ICL is primarily due to retrieving relevant examples rather than learning a better task boundary during encoding.
5. **Efficiency and Trade-offs**: While long-context ICL can be more effective, it may come at the cost of increased inference time compared to finetuning, which can be more efficient for large datasets.
The paper also discusses the limitations of current understanding of ICL and suggests that further research is needed to validate hypotheses about ICL at larger scales. Overall, the findings highlight the potential of long-context ICL as a powerful tool for various tasks, especially when combined with efficient caching and reusing of long demonstration sets.This paper explores the behavior of in-context learning (ICL) with long-context models, focusing on the performance and properties of ICL as the number of demonstrations approaches the size of entire training datasets. The study is conducted on multiple datasets and models, including Llama-2 and Mistral-7b-v0.2. Key findings include:
1. **Performance Scaling**: ICL continues to improve with hundreds or thousands of demonstrations, often approaching or exceeding the performance of finetuning on the same data.
2. **Sensitivity to Example Order**: Long-context ICL is less sensitive to example order compared to short-context ICL, suggesting that the order of examples does not significantly impact performance.
3. **Label Sorting Impact**: Grouping examples by label can negatively impact performance, especially in the long-context regime, indicating that contextualization of different labels is crucial.
4. **Relevance of Examples**: The effectiveness of long-context ICL is primarily due to retrieving relevant examples rather than learning a better task boundary during encoding.
5. **Efficiency and Trade-offs**: While long-context ICL can be more effective, it may come at the cost of increased inference time compared to finetuning, which can be more efficient for large datasets.
The paper also discusses the limitations of current understanding of ICL and suggests that further research is needed to validate hypotheses about ICL at larger scales. Overall, the findings highlight the potential of long-context ICL as a powerful tool for various tasks, especially when combined with efficient caching and reusing of long demonstration sets.