This paper investigates the representational capacity of transformer language models (LMS) in relation to n-gram LMs. It shows that transformer LMs using hard or sparse attention mechanisms can exactly represent any n-gram LM, providing a concrete lower bound on their probabilistic representational capacity. The study focuses on the relationship between transformer LMs and n-gram LMs, which are simple and historically relevant language models. The paper demonstrates that both hard and sparse attention transformer LMs can represent any n-gram LM, establishing a foundational understanding of how transformer LMs can simulate probability distributions over strings.
The paper also explores the role of the number of heads and layers in the simulation of n-gram LMs, illustrating a trade-off between these factors and the complexity of non-linear transformations required. The results contribute to understanding the mechanisms that transformer LMs use to represent probability distributions over strings.
The paper addresses the limitations of treating LMs as recognizers and advocates for analyzing LMs as probabilistic formal languages. It also discusses the connection between transformer LMs and n-gram LMs, which are a special case of sub-regular LMs. The study shows that transformer LMs can simulate n-gram LMs using different mechanisms, including hard and sparse attention, and highlights the importance of parallelizable processing in the transformer architecture.
The paper provides theoretical insights into the representational capacity of transformer LMs, showing that they can represent a wide range of probability distributions, including n-gram LMs. It also discusses the implications of these findings for the interpretability and computational power of transformer LMs. The results suggest that transformer LMs can represent more than just n-gram LMs, aligning with their empirical success in language modeling tasks. The study concludes that non-sequential models of computation are valuable for understanding the capabilities of transformers.This paper investigates the representational capacity of transformer language models (LMS) in relation to n-gram LMs. It shows that transformer LMs using hard or sparse attention mechanisms can exactly represent any n-gram LM, providing a concrete lower bound on their probabilistic representational capacity. The study focuses on the relationship between transformer LMs and n-gram LMs, which are simple and historically relevant language models. The paper demonstrates that both hard and sparse attention transformer LMs can represent any n-gram LM, establishing a foundational understanding of how transformer LMs can simulate probability distributions over strings.
The paper also explores the role of the number of heads and layers in the simulation of n-gram LMs, illustrating a trade-off between these factors and the complexity of non-linear transformations required. The results contribute to understanding the mechanisms that transformer LMs use to represent probability distributions over strings.
The paper addresses the limitations of treating LMs as recognizers and advocates for analyzing LMs as probabilistic formal languages. It also discusses the connection between transformer LMs and n-gram LMs, which are a special case of sub-regular LMs. The study shows that transformer LMs can simulate n-gram LMs using different mechanisms, including hard and sparse attention, and highlights the importance of parallelizable processing in the transformer architecture.
The paper provides theoretical insights into the representational capacity of transformer LMs, showing that they can represent a wide range of probability distributions, including n-gram LMs. It also discusses the implications of these findings for the interpretability and computational power of transformer LMs. The results suggest that transformer LMs can represent more than just n-gram LMs, aligning with their empirical success in language modeling tasks. The study concludes that non-sequential models of computation are valuable for understanding the capabilities of transformers.