Analysing The Impact of Sequence Composition on Language Model Pre-Training

Analysing The Impact of Sequence Composition on Language Model Pre-Training

21 Feb 2024 | Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miłoś, Yuxiang Wu, Pasquale Minervini
This paper investigates the impact of sequence composition strategies on the performance of language models during pre-training. The authors find that using causal masking without considering document boundaries can lead to the inclusion of distracting information from previous documents, which negatively affects the performance of the models on language modelling and downstream tasks. In contrast, intra-document causal masking, which conditions the likelihood of each token only on the previous tokens in the same document, significantly improves performance while increasing runtime. Additionally, the authors propose an efficient retrieval-based sequence construction method, BM25Chunk, which improves in-context learning, knowledge memorisation, and context utilisation without sacrificing efficiency. The study compares different packing strategies for pre-training sequences, including MixChunk, UNIChunk, and BM25Chunk. The results show that BM25Chunk achieves the lowest perplexity among causal masking models and significantly improves the performance of language models on various downstream tasks. The authors also find that increasing the relatedness of documents in pre-training chunks reduces potential distractions and improves the performance of causal masking models. The paper also evaluates the in-context learning, knowledge memorisation, and context utilisation abilities of the models on various downstream tasks. The results show that BM25Chunk and INTRADoc achieve the best performance in these tasks. The study further investigates the attention distribution of the models during language modelling and finds that models trained with BM25Chunk and INTRADoc are more robust to irrelevant contexts and can better utilise relevant information. The authors conclude that the sequence composition strategy has a significant impact on the performance of language models during pre-training. Intra-document causal masking and efficient retrieval-based sequence construction methods can significantly improve the performance of language models without sacrificing efficiency. The study highlights the importance of considering the relatedness of documents in pre-training chunks to reduce distractions and improve the performance of language models.This paper investigates the impact of sequence composition strategies on the performance of language models during pre-training. The authors find that using causal masking without considering document boundaries can lead to the inclusion of distracting information from previous documents, which negatively affects the performance of the models on language modelling and downstream tasks. In contrast, intra-document causal masking, which conditions the likelihood of each token only on the previous tokens in the same document, significantly improves performance while increasing runtime. Additionally, the authors propose an efficient retrieval-based sequence construction method, BM25Chunk, which improves in-context learning, knowledge memorisation, and context utilisation without sacrificing efficiency. The study compares different packing strategies for pre-training sequences, including MixChunk, UNIChunk, and BM25Chunk. The results show that BM25Chunk achieves the lowest perplexity among causal masking models and significantly improves the performance of language models on various downstream tasks. The authors also find that increasing the relatedness of documents in pre-training chunks reduces potential distractions and improves the performance of causal masking models. The paper also evaluates the in-context learning, knowledge memorisation, and context utilisation abilities of the models on various downstream tasks. The results show that BM25Chunk and INTRADoc achieve the best performance in these tasks. The study further investigates the attention distribution of the models during language modelling and finds that models trained with BM25Chunk and INTRADoc are more robust to irrelevant contexts and can better utilise relevant information. The authors conclude that the sequence composition strategy has a significant impact on the performance of language models during pre-training. Intra-document causal masking and efficient retrieval-based sequence construction methods can significantly improve the performance of language models without sacrificing efficiency. The study highlights the importance of considering the relatedness of documents in pre-training chunks to reduce distractions and improve the performance of language models.
Reach us at info@study.space