18 Apr 2025 | Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
This paper explores the effectiveness of in-context learning (ICL) for instruction following in large language models (LLMs). The authors build on the work of Lin et al. (2024), who proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. However, the paper finds that while URIAL is effective, it still underperforms compared to instruction fine-tuning (IFT) on the established benchmark MT-Bench, especially with more capable base LLMs.
The authors then analyze the key elements for successful ICL alignment, identifying the crucial role of decoding parameters. They show that optimizing these parameters can significantly improve the performance of base models. Additionally, they demonstrate that adding high-quality, carefully selected demonstrations in context can further enhance performance, bringing ICL closer to the performance of instruction-tuned models.
The paper also provides a systematic comparison of ICL and IFT for instruction following in the low-data regime, where ICL can be a viable alternative to IFT. The results show that with high-quality data, ICL and IFT achieve almost identical first-turn MT-Bench scores, but IFT outperforms ICL in second-turn scores, suggesting better generalization to multi-turn conversations.
Overall, the work advances the understanding of ICL as an alignment technique and its relationship to IFT, providing insights into the limitations and potential of ICL in the context of instruction following. The authors conclude by suggesting that ICL can be a useful baseline for customization without fine-tuning, but it may not fully compete with more sophisticated alignment techniques in terms of generalization and performance.This paper explores the effectiveness of in-context learning (ICL) for instruction following in large language models (LLMs). The authors build on the work of Lin et al. (2024), who proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. However, the paper finds that while URIAL is effective, it still underperforms compared to instruction fine-tuning (IFT) on the established benchmark MT-Bench, especially with more capable base LLMs.
The authors then analyze the key elements for successful ICL alignment, identifying the crucial role of decoding parameters. They show that optimizing these parameters can significantly improve the performance of base models. Additionally, they demonstrate that adding high-quality, carefully selected demonstrations in context can further enhance performance, bringing ICL closer to the performance of instruction-tuned models.
The paper also provides a systematic comparison of ICL and IFT for instruction following in the low-data regime, where ICL can be a viable alternative to IFT. The results show that with high-quality data, ICL and IFT achieve almost identical first-turn MT-Bench scores, but IFT outperforms ICL in second-turn scores, suggesting better generalization to multi-turn conversations.
Overall, the work advances the understanding of ICL as an alignment technique and its relationship to IFT, providing insights into the limitations and potential of ICL in the context of instruction following. The authors conclude by suggesting that ICL can be a useful baseline for customization without fine-tuning, but it may not fully compete with more sophisticated alignment techniques in terms of generalization and performance.