2024 | Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
The paper "Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning" explores the effectiveness of selecting the 1,000 longest responses from standard datasets for instruction fine-tuning (IFT) of large language models (LLMs). The authors argue that these longer responses contain more learnable information and are harder to overfit, leading to better performance compared to more sophisticated methods that use manual curation or GPT-3.5-Turbo as a quality scorer. They demonstrate that their simple baseline consistently outperforms state-of-the-art methods according to GPT-4 and PaLM-2, while remaining competitive on Open LLM benchmarks that test factual knowledge. The study also shows that refining these long instructions can further improve the performance of fine-tuned LLMs, achieving competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0. The authors provide extensive analysis to ensure that the enhanced performance is not due to GPT-4's preference for longer responses and conclude that fine-tuning on the longest responses should be the default baseline for IFT.The paper "Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning" explores the effectiveness of selecting the 1,000 longest responses from standard datasets for instruction fine-tuning (IFT) of large language models (LLMs). The authors argue that these longer responses contain more learnable information and are harder to overfit, leading to better performance compared to more sophisticated methods that use manual curation or GPT-3.5-Turbo as a quality scorer. They demonstrate that their simple baseline consistently outperforms state-of-the-art methods according to GPT-4 and PaLM-2, while remaining competitive on Open LLM benchmarks that test factual knowledge. The study also shows that refining these long instructions can further improve the performance of fine-tuned LLMs, achieving competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0. The authors provide extensive analysis to ensure that the enhanced performance is not due to GPT-4's preference for longer responses and conclude that fine-tuning on the longest responses should be the default baseline for IFT.