31 Jan 2024 | Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li
LongAlign is a method for aligning large language models (LLMs) to handle long contexts. The approach involves creating a diverse dataset of long instruction-following examples using Self-Instruct, followed by efficient training techniques such as packing and sorted batching to improve training efficiency. A new benchmark, LongBench-Chat, is introduced to evaluate the models' ability to follow instructions on long queries. Experiments show that LongAlign outperforms existing methods in long context tasks by up to 30%, while maintaining performance on short tasks. The method also includes a loss weighting strategy to balance the contribution of different sequences during training. LongAlign is open-sourced, providing datasets, code, and models for research and development. The approach addresses challenges in data diversity, training efficiency, and benchmarking for long context tasks, demonstrating that effective alignment can be achieved through careful data selection and training strategies. The results highlight the importance of data quantity and diversity, as well as the effectiveness of packing and sorted batching in improving training efficiency and model performance on long context tasks.LongAlign is a method for aligning large language models (LLMs) to handle long contexts. The approach involves creating a diverse dataset of long instruction-following examples using Self-Instruct, followed by efficient training techniques such as packing and sorted batching to improve training efficiency. A new benchmark, LongBench-Chat, is introduced to evaluate the models' ability to follow instructions on long queries. Experiments show that LongAlign outperforms existing methods in long context tasks by up to 30%, while maintaining performance on short tasks. The method also includes a loss weighting strategy to balance the contribution of different sequences during training. LongAlign is open-sourced, providing datasets, code, and models for research and development. The approach addresses challenges in data diversity, training efficiency, and benchmarking for long context tasks, demonstrating that effective alignment can be achieved through careful data selection and training strategies. The results highlight the importance of data quantity and diversity, as well as the effectiveness of packing and sorted batching in improving training efficiency and model performance on long context tasks.