March 6, 2024 | Angeliki Giannou*, Liu Yang†, Tianhao Wang‡, Dimitris Papailiopoulos§, Jason D. Lee♣
This paper explores the ability of Transformers to implement higher-order optimization methods, specifically Newton's method, for in-context learning. The authors establish that linear attention Transformers with ReLU layers can approximate second-order optimization algorithms for logistic regression and achieve ε error with only a logarithmic number of layers. They also demonstrate that linear attention-only Transformers can implement a single step of Newton's iteration for matrix inversion with just two layers. These findings suggest that Transformers can implement complex algorithms beyond gradient descent. The paper provides concrete constructions and explicit bounds on the depth and width of the model required for these tasks, and supports the theoretical results with experimental evidence. The experiments show that trained Transformers outperform Newton's method in the initial layers and steps, and that layer norm significantly improves performance. The paper also compares the predictions of different orders of Newton's iteration with the Transformer architecture and the acquired loss.This paper explores the ability of Transformers to implement higher-order optimization methods, specifically Newton's method, for in-context learning. The authors establish that linear attention Transformers with ReLU layers can approximate second-order optimization algorithms for logistic regression and achieve ε error with only a logarithmic number of layers. They also demonstrate that linear attention-only Transformers can implement a single step of Newton's iteration for matrix inversion with just two layers. These findings suggest that Transformers can implement complex algorithms beyond gradient descent. The paper provides concrete constructions and explicit bounds on the depth and width of the model required for these tasks, and supports the theoretical results with experimental evidence. The experiments show that trained Transformers outperform Newton's method in the initial layers and steps, and that layer norm significantly improves performance. The paper also compares the predictions of different orders of Newton's iteration with the Transformer architecture and the acquired loss.