2 May 2024 | Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto
The paper "Fewer Truncations Improve Language Modeling" addresses the issue of excessive truncation in large language model (LLM) training, which can lead to data integrity issues and hinder the model's ability to learn coherent and factually consistent content. The authors propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. This method eliminates unnecessary truncations while maintaining the same training efficiency as the common concatenation approach. Empirical results show that Best-fit Packing achieves superior performance on various tasks, including reading comprehension, context following, and program synthesis, and effectively reduces closed-domain hallucination by up to 58.3%. The paper highlights the importance of data integrity in LLM training and provides a practical solution to improve model performance and reduce hallucination.The paper "Fewer Truncations Improve Language Modeling" addresses the issue of excessive truncation in large language model (LLM) training, which can lead to data integrity issues and hinder the model's ability to learn coherent and factually consistent content. The authors propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. This method eliminates unnecessary truncations while maintaining the same training efficiency as the common concatenation approach. Empirical results show that Best-fit Packing achieves superior performance on various tasks, including reading comprehension, context following, and program synthesis, and effectively reduces closed-domain hallucination by up to 58.3%. The paper highlights the importance of data integrity in LLM training and provides a practical solution to improve model performance and reduce hallucination.