21 Feb 2024 | Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miłoś, Yuxiang Wu, Pasquale Minervini
This paper investigates the impact of sequence composition on language model pre-training, focusing on the influence of causal masking and intra-document causal masking. The authors find that causal masking can lead to the inclusion of distracting information from previous documents, negatively affecting performance on downstream tasks. Intra-document causal masking, which conditions the likelihood of each token only on the previous tokens within the same document, significantly improves performance while increasing runtime by 4%. The paper also introduces Bm25Chunk, an efficient retrieval-based packing method that improves language modeling, in-context learning, knowledge memorisation, and context utilisation abilities without sacrificing efficiency. Experimental results on various datasets and models demonstrate the effectiveness of these methods in reducing distractions and enhancing model performance.This paper investigates the impact of sequence composition on language model pre-training, focusing on the influence of causal masking and intra-document causal masking. The authors find that causal masking can lead to the inclusion of distracting information from previous documents, negatively affecting performance on downstream tasks. Intra-document causal masking, which conditions the likelihood of each token only on the previous tokens within the same document, significantly improves performance while increasing runtime by 4%. The paper also introduces Bm25Chunk, an efficient retrieval-based packing method that improves language modeling, in-context learning, knowledge memorisation, and context utilisation abilities without sacrificing efficiency. Experimental results on various datasets and models demonstrate the effectiveness of these methods in reducing distractions and enhancing model performance.