2024 | Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme
Large Language Models (LLMs) often provide a reported cutoff date, the time when their training data was collected. However, this cutoff may not accurately reflect the effective cutoff for all resources in the model. This paper investigates the effective cutoff of LLMs by analyzing their performance on different versions of training data. The effective cutoff is defined as the date when the model's knowledge aligns most closely with the data. The study finds that effective cutoffs often differ from reported cutoffs due to issues such as temporal misalignment in CommonCrawl data and complications in deduplication processes. The research proposes a method to estimate effective cutoffs by measuring perplexity across different versions of data. The analysis reveals that LLMs trained on CommonCrawl data may include outdated information, leading to misaligned cutoffs. The study also highlights the importance of understanding the temporal alignment of training data to ensure accurate knowledge cutoffs for LLM applications. The results show that effective cutoffs can vary significantly between models and resources, emphasizing the need for careful consideration of data sources and training processes when using LLMs. The findings contribute to a better understanding of how LLMs handle temporal data and the importance of transparency in reporting knowledge cutoffs.Large Language Models (LLMs) often provide a reported cutoff date, the time when their training data was collected. However, this cutoff may not accurately reflect the effective cutoff for all resources in the model. This paper investigates the effective cutoff of LLMs by analyzing their performance on different versions of training data. The effective cutoff is defined as the date when the model's knowledge aligns most closely with the data. The study finds that effective cutoffs often differ from reported cutoffs due to issues such as temporal misalignment in CommonCrawl data and complications in deduplication processes. The research proposes a method to estimate effective cutoffs by measuring perplexity across different versions of data. The analysis reveals that LLMs trained on CommonCrawl data may include outdated information, leading to misaligned cutoffs. The study also highlights the importance of understanding the temporal alignment of training data to ensure accurate knowledge cutoffs for LLM applications. The results show that effective cutoffs can vary significantly between models and resources, emphasizing the need for careful consideration of data sources and training processes when using LLMs. The findings contribute to a better understanding of how LLMs handle temporal data and the importance of transparency in reporting knowledge cutoffs.