The paper "Why are Sensitive Functions Hard for Transformers?" by Michael Hahn and Mark Rofin explores the theoretical underpinnings of why certain functions, particularly those with high sensitivity, are difficult for transformers to learn. The authors identify a key issue: the loss landscape of transformers is constrained by input-space sensitivity, leading to isolated points in parameter space where high-sensitivity functions are located. This results in a low-sensitivity bias in generalization, explaining phenomena such as the difficulty in learning the PARITY function and the bias towards low-degree functions.
The paper provides a rigorous theoretical framework to understand these empirical observations. It proves that for transformers, high sensitivity in input space corresponds to sharp minima in the loss landscape, making it challenging to find stable solutions. The authors show that transformers fitting sensitive functions must have large parameter norms and layer norm blowup, which increases with input length. This trade-off between parameters and layer norm blowup is necessary to represent high-sensitivity functions.
Empirical results support these theoretical findings, showing that transformers trained to fit sensitive functions exhibit increased sharpness and require larger parameter norms or layer norm blowup. The paper also discusses the implications of these results for understanding transformers' generalization behavior, including the bias towards low-sensitivity functions and the difficulty in length generalization for functions like PARITY.
Overall, the study highlights the importance of considering both the expressiveness and the loss landscape of transformers to fully understand their learning capabilities and limitations.The paper "Why are Sensitive Functions Hard for Transformers?" by Michael Hahn and Mark Rofin explores the theoretical underpinnings of why certain functions, particularly those with high sensitivity, are difficult for transformers to learn. The authors identify a key issue: the loss landscape of transformers is constrained by input-space sensitivity, leading to isolated points in parameter space where high-sensitivity functions are located. This results in a low-sensitivity bias in generalization, explaining phenomena such as the difficulty in learning the PARITY function and the bias towards low-degree functions.
The paper provides a rigorous theoretical framework to understand these empirical observations. It proves that for transformers, high sensitivity in input space corresponds to sharp minima in the loss landscape, making it challenging to find stable solutions. The authors show that transformers fitting sensitive functions must have large parameter norms and layer norm blowup, which increases with input length. This trade-off between parameters and layer norm blowup is necessary to represent high-sensitivity functions.
Empirical results support these theoretical findings, showing that transformers trained to fit sensitive functions exhibit increased sharpness and require larger parameter norms or layer norm blowup. The paper also discusses the implications of these results for understanding transformers' generalization behavior, including the bias towards low-sensitivity functions and the difficulty in length generalization for functions like PARITY.
Overall, the study highlights the importance of considering both the expressiveness and the loss landscape of transformers to fully understand their learning capabilities and limitations.