28 Jun 2024 | Chantal Shaib1 Yanai Elazar2,3 Junyi Jessy Li4 Byron C. Wallace1
This paper introduces the concept of *syntactic templates* to analyze the repetitive patterns in text generated by large language models (LLMs). The authors define syntactic templates as sequences of part-of-speech (POS) tags that appear frequently in model-generated text. They find that these templates are more prevalent in model-generated text compared to human-written references, and that most of these templates can be traced back to the pre-training data. The study evaluates eight models on three different tasks—open-ended generation, synthetic data generation, and summarization—to assess the incidence and characteristics of templates. Key findings include:
1. **Template Incidence**: Model-generated text shows a higher rate of templates compared to human-written references, especially in summarization tasks.
2. **Pre-training Data Analysis**: Most templates found in model-generated text are also present in the pre-training data, indicating that models learn these patterns early in the training process.
3. **Template Memorization**: The analysis of style memorization reveals that models can memorize specific syntactic patterns from the pre-training data, even when exact token matches are not observed.
4. **Model Size Impact**: Larger models do not necessarily produce less templated outputs, contrary to expectations.
The paper concludes by suggesting that syntactic templates can be useful for characterizing the stylistic patterns learned by LLMs and for detecting subtle forms of data memorization. The work aims to inspire further research into the origins of stylistic patterns in LLM outputs.This paper introduces the concept of *syntactic templates* to analyze the repetitive patterns in text generated by large language models (LLMs). The authors define syntactic templates as sequences of part-of-speech (POS) tags that appear frequently in model-generated text. They find that these templates are more prevalent in model-generated text compared to human-written references, and that most of these templates can be traced back to the pre-training data. The study evaluates eight models on three different tasks—open-ended generation, synthetic data generation, and summarization—to assess the incidence and characteristics of templates. Key findings include:
1. **Template Incidence**: Model-generated text shows a higher rate of templates compared to human-written references, especially in summarization tasks.
2. **Pre-training Data Analysis**: Most templates found in model-generated text are also present in the pre-training data, indicating that models learn these patterns early in the training process.
3. **Template Memorization**: The analysis of style memorization reveals that models can memorize specific syntactic patterns from the pre-training data, even when exact token matches are not observed.
4. **Model Size Impact**: Larger models do not necessarily produce less templated outputs, contrary to expectations.
The paper concludes by suggesting that syntactic templates can be useful for characterizing the stylistic patterns learned by LLMs and for detecting subtle forms of data memorization. The work aims to inspire further research into the origins of stylistic patterns in LLM outputs.