This paper explores the representational capacity of transformer language models (LMs) in relation to n-gram LMs, a historically significant and simple class of probabilistic language models. The authors argue that focusing on language acceptance is not an appropriate approach for studying language models, which are inherently probability distributions over strings. Instead, they demonstrate that transformer LMs, using either hard or sparse attention mechanisms, can exactly represent any n-gram LM, providing a concrete lower bound on their probabilistic representational capacity. This establishes a foundational step towards understanding how transformer LMs can implement formal models of computation.
The paper presents several key findings:
1. **Hard Attention Transformer LMs**: Transformer LMs with hard attention can represent n-gram LMs using \( n-1 \) heads or \( n-1 \) layers. Even a single head and a single layer can simulate an n-gram LM with more complex non-linear transformations and positional encodings.
2. **Sparse Attention Transformer LMs**: Sparse attention transformers, which use the sparsemax normalization function, can also represent n-gram LMs with \( n-1 \) heads. This approach relies on less standard components but still allows for efficient simulation.
3. **Space Complexity**: The space complexity of simulating n-gram LMs is analyzed, showing that the contextual representations and final representations scale logarithmically with the length of the generated string.
The authors discuss the relevance of n-gram LMs to modern LMs, the limitations and abilities of hard attention transformer LMs, and the probabilistic representational capacity of transformers. They also highlight the connection between transformer LMs and sub-regular languages, noting that while n-gram LMs are simple, they provide a useful theoretical foundation for understanding more complex models.
The paper concludes by emphasizing the utility of non-sequential models of computation for studying transformers, particularly in the context of language modeling. However, it acknowledges the limitations of the n-gram model and suggests that future work should aim to tighten the established lower bounds and explore the upper bounds of transformer LMs' representational capacity.This paper explores the representational capacity of transformer language models (LMs) in relation to n-gram LMs, a historically significant and simple class of probabilistic language models. The authors argue that focusing on language acceptance is not an appropriate approach for studying language models, which are inherently probability distributions over strings. Instead, they demonstrate that transformer LMs, using either hard or sparse attention mechanisms, can exactly represent any n-gram LM, providing a concrete lower bound on their probabilistic representational capacity. This establishes a foundational step towards understanding how transformer LMs can implement formal models of computation.
The paper presents several key findings:
1. **Hard Attention Transformer LMs**: Transformer LMs with hard attention can represent n-gram LMs using \( n-1 \) heads or \( n-1 \) layers. Even a single head and a single layer can simulate an n-gram LM with more complex non-linear transformations and positional encodings.
2. **Sparse Attention Transformer LMs**: Sparse attention transformers, which use the sparsemax normalization function, can also represent n-gram LMs with \( n-1 \) heads. This approach relies on less standard components but still allows for efficient simulation.
3. **Space Complexity**: The space complexity of simulating n-gram LMs is analyzed, showing that the contextual representations and final representations scale logarithmically with the length of the generated string.
The authors discuss the relevance of n-gram LMs to modern LMs, the limitations and abilities of hard attention transformer LMs, and the probabilistic representational capacity of transformers. They also highlight the connection between transformer LMs and sub-regular languages, noting that while n-gram LMs are simple, they provide a useful theoretical foundation for understanding more complex models.
The paper concludes by emphasizing the utility of non-sequential models of computation for studying transformers, particularly in the context of language modeling. However, it acknowledges the limitations of the n-gram model and suggests that future work should aim to tighten the established lower bounds and explore the upper bounds of transformer LMs' representational capacity.