This paper provides the first theoretical analysis of how non-linear transformers learn and generalize in in-context learning (ICL). The study focuses on binary classification tasks and investigates how training a transformer with non-linear self-attention and MLP affects its ICL generalization capability. The paper also analyzes how model pruning impacts ICL performance, proving that magnitude-based pruning can reduce inference costs with minimal impact on ICL.
The paper shows that training a transformer using prompts from a subset of binary classification tasks can result in a model that generalizes to the remaining tasks. It provides a quantitative analysis of the required number of training data, iterations, prompt length, and resulting ICL performance. The analysis is based on a simplified single-head and one-layer transformer with softmax self-attention and ReLU MLP, but the theoretical insights are applicable to practical architectures.
The paper proves that when a properly trained transformer receives a prompt, the attention weights are concentrated on contexts that share the same relevant pattern as the query. The ReLU MLP layer then promotes the label embedding of these examples, leading to correct predictions for the query. The paper also provides theoretical justification for magnitude-based pruning in preserving ICL. It shows that pruning neurons with small magnitudes has little effect on generalization, while pruning the remaining neurons leads to a large generalization error that increases with the pruning rate.
The paper also discusses related work on the expressive power of ICL, the optimization and generalization of transformers, and the theoretical analysis of pruning. It concludes that the study provides new insights into the training and generalization of transformers in ICL, and the impact of model pruning on ICL performance.This paper provides the first theoretical analysis of how non-linear transformers learn and generalize in in-context learning (ICL). The study focuses on binary classification tasks and investigates how training a transformer with non-linear self-attention and MLP affects its ICL generalization capability. The paper also analyzes how model pruning impacts ICL performance, proving that magnitude-based pruning can reduce inference costs with minimal impact on ICL.
The paper shows that training a transformer using prompts from a subset of binary classification tasks can result in a model that generalizes to the remaining tasks. It provides a quantitative analysis of the required number of training data, iterations, prompt length, and resulting ICL performance. The analysis is based on a simplified single-head and one-layer transformer with softmax self-attention and ReLU MLP, but the theoretical insights are applicable to practical architectures.
The paper proves that when a properly trained transformer receives a prompt, the attention weights are concentrated on contexts that share the same relevant pattern as the query. The ReLU MLP layer then promotes the label embedding of these examples, leading to correct predictions for the query. The paper also provides theoretical justification for magnitude-based pruning in preserving ICL. It shows that pruning neurons with small magnitudes has little effect on generalization, while pruning the remaining neurons leads to a large generalization error that increases with the pruning rate.
The paper also discusses related work on the expressive power of ICL, the optimization and generalization of transformers, and the theoretical analysis of pruning. It concludes that the study provides new insights into the training and generalization of transformers in ICL, and the impact of model pruning on ICL performance.