Why are Sensitive Functions Hard for Transformers?

Why are Sensitive Functions Hard for Transformers?

27 May 2024 | Michael Hahn, Mark Rofin
Transformers struggle with learning sensitive functions due to their loss landscape being constrained by input-space sensitivity. This sensitivity leads to isolated parameter settings, making it difficult to generalize to longer inputs. Theoretical analysis shows that high sensitivity functions require sharp minima in the loss landscape, explaining why transformers have a bias towards low-sensitivity and low-degree functions, and struggle with length generalization for tasks like PARITY. Empirical studies confirm these findings, showing that transformers trained on longer inputs have steeper loss landscapes. Theoretical results also show that transformers with high sensitivity require large parameter norms and significant layer norm blowup, which increases with input length. These findings unify various empirical observations about transformer learning biases and suggest that understanding transformer capabilities requires studying both their expressivity and loss landscape. Experiments support these conclusions, showing that transformers trained on sensitive functions have higher sharpness and that using scratchpads can reduce sensitivity. Theoretical results also indicate that transformers generalize better when the input has bounded sensitivity, and that the loss landscape's shape plays a crucial role in their learning abilities.Transformers struggle with learning sensitive functions due to their loss landscape being constrained by input-space sensitivity. This sensitivity leads to isolated parameter settings, making it difficult to generalize to longer inputs. Theoretical analysis shows that high sensitivity functions require sharp minima in the loss landscape, explaining why transformers have a bias towards low-sensitivity and low-degree functions, and struggle with length generalization for tasks like PARITY. Empirical studies confirm these findings, showing that transformers trained on longer inputs have steeper loss landscapes. Theoretical results also show that transformers with high sensitivity require large parameter norms and significant layer norm blowup, which increases with input length. These findings unify various empirical observations about transformer learning biases and suggest that understanding transformer capabilities requires studying both their expressivity and loss landscape. Experiments support these conclusions, showing that transformers trained on sensitive functions have higher sharpness and that using scratchpads can reduce sensitivity. Theoretical results also indicate that transformers generalize better when the input has bounded sensitivity, and that the loss landscape's shape plays a crucial role in their learning abilities.
Reach us at info@study.space
[slides] Why are Sensitive Functions Hard for Transformers%3F | StudySpace