28 Jun 2024 | Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace
This paper introduces syntactic templates as a method to analyze repetitive patterns in text generated by large language models (LLMs). Syntactic templates are defined as sequences of part-of-speech (POS) tags that repeat in generated text. The study finds that models tend to produce more templated text than human-written references, with 76% of templates in model-generated text found in pre-training data, compared to 35% in human-authored text. Templates are not overwritten during fine-tuning processes like RLHF, and their presence in pre-training data allows for analysis of syntactic patterns in models without access to the training data.
The study evaluates eight models across three tasks and finds that templates can differentiate between models, tasks, and domains. Templates are useful for qualitatively evaluating common model constructions and for analyzing style memorization of training data in LLMs. The research also shows that templates can be used to detect data memorization, with models memorizing between 0.8-3.1% of texts, often by replacing numbers and generating synonyms.
The study demonstrates that templates are learned early in pre-training, not during fine-tuning, and that templates are more common in pre-training data than in random n-grams. Templates are also found to be more frequent in model-generated text than in human-written references, especially for longer templates. The study further shows that templates can be used to evaluate the memorization of training data, with models memorizing between 5.3-6.4% of texts, depending on the definition used.
The research highlights the importance of analyzing syntactic patterns in LLMs and provides a framework for understanding how models learn and reproduce patterns from their training data. The findings suggest that models may be more likely to produce repetitive structures in downstream tasks, and that templates can be used to assess the diversity and memorization of training data in LLMs. The study also shows that larger models do not necessarily produce less templated outputs, and that templates can be used to evaluate the quality of generated text in various tasks.This paper introduces syntactic templates as a method to analyze repetitive patterns in text generated by large language models (LLMs). Syntactic templates are defined as sequences of part-of-speech (POS) tags that repeat in generated text. The study finds that models tend to produce more templated text than human-written references, with 76% of templates in model-generated text found in pre-training data, compared to 35% in human-authored text. Templates are not overwritten during fine-tuning processes like RLHF, and their presence in pre-training data allows for analysis of syntactic patterns in models without access to the training data.
The study evaluates eight models across three tasks and finds that templates can differentiate between models, tasks, and domains. Templates are useful for qualitatively evaluating common model constructions and for analyzing style memorization of training data in LLMs. The research also shows that templates can be used to detect data memorization, with models memorizing between 0.8-3.1% of texts, often by replacing numbers and generating synonyms.
The study demonstrates that templates are learned early in pre-training, not during fine-tuning, and that templates are more common in pre-training data than in random n-grams. Templates are also found to be more frequent in model-generated text than in human-written references, especially for longer templates. The study further shows that templates can be used to evaluate the memorization of training data, with models memorizing between 5.3-6.4% of texts, depending on the definition used.
The research highlights the importance of analyzing syntactic patterns in LLMs and provides a framework for understanding how models learn and reproduce patterns from their training data. The findings suggest that models may be more likely to produce repetitive structures in downstream tasks, and that templates can be used to assess the diversity and memorization of training data in LLMs. The study also shows that larger models do not necessarily produce less templated outputs, and that templates can be used to evaluate the quality of generated text in various tasks.