17 Sep 2024 | Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme
The paper "Dated Data: Tracing Knowledge Cutoffs in Large Language Models" by Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme from Johns Hopkins University explores the discrepancies between the reported cutoff dates and the effective cutoff dates of knowledge in large language models (LLMs). The authors define the *effective cutoff* as the date when the model's knowledge of a resource aligns with the most recent version of that resource, rather than the reported cutoff date. They propose a method to estimate these effective cutoffs by probing LLMs with varying versions of resources and analyzing the resulting perplexity curves.
Key findings include:
1. **Effective Cutoffs Differ from Reported Cutoffs**: Many LLMs have effective cutoffs that differ significantly from their reported cutoffs, particularly for newer models.
2. **Complications in Deduplication Pipelines**: LLM training datasets often contain near-duplicate and exact duplicate documents, which are not effectively removed by deduplication pipelines.
3. **Temporal Misalignments in CommonCrawl Dumps**: CommonCrawl dumps often include older versions of documents, leading to temporal misalignments in the models' knowledge.
The authors provide a detailed analysis of pre-training datasets and conclude that the reported cutoffs are not always accurate, highlighting the need for more transparent and precise reporting of knowledge cutoffs in LLMs. They also discuss the implications for both LLM creators and users, emphasizing the importance of understanding the temporal boundaries of LLMs' knowledge. The results and code are available at <https://github.com/nevron/cdated_data/>.The paper "Dated Data: Tracing Knowledge Cutoffs in Large Language Models" by Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme from Johns Hopkins University explores the discrepancies between the reported cutoff dates and the effective cutoff dates of knowledge in large language models (LLMs). The authors define the *effective cutoff* as the date when the model's knowledge of a resource aligns with the most recent version of that resource, rather than the reported cutoff date. They propose a method to estimate these effective cutoffs by probing LLMs with varying versions of resources and analyzing the resulting perplexity curves.
Key findings include:
1. **Effective Cutoffs Differ from Reported Cutoffs**: Many LLMs have effective cutoffs that differ significantly from their reported cutoffs, particularly for newer models.
2. **Complications in Deduplication Pipelines**: LLM training datasets often contain near-duplicate and exact duplicate documents, which are not effectively removed by deduplication pipelines.
3. **Temporal Misalignments in CommonCrawl Dumps**: CommonCrawl dumps often include older versions of documents, leading to temporal misalignments in the models' knowledge.
The authors provide a detailed analysis of pre-training datasets and conclude that the reported cutoffs are not always accurate, highlighting the need for more transparent and precise reporting of knowledge cutoffs in LLMs. They also discuss the implications for both LLM creators and users, emphasizing the importance of understanding the temporal boundaries of LLMs' knowledge. The results and code are available at <https://github.com/nevron/cdated_data/>.