Prompting a Pretrained Transformer Can Be a Universal Approximator

Prompting a Pretrained Transformer Can Be a Universal Approximator

22 Feb 2024 | Aleksandar Petrov, Philip H.S. Torr, Adel Bibi
This paper investigates whether prefix-tuning of a pretrained transformer can universally approximate sequence-to-sequence functions. The authors show that a single attention head can approximate any continuous function on a hypersphere, and that any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. They also provide Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision. The key findings are: 1. A single attention head is sufficient to approximate any smooth continuous function on the hypersphere $ S^m $ to any desired precision $ \epsilon $. 2. The required prompt length to approximate a smooth target function to a precision $ \epsilon $ is bounded. 3. Transformers of depth linear in the sequence length can approximate general sequence-to-sequence functions. 4. Prefix-tuning can result in elementwise functions that, when combined with cross-element mixing from the pretrained model, may explain the success of prefix-tuning and prompting. The paper also discusses the limitations of prefix-tuning, including the need for specific attention and value matrices, and the potential risks of using transformers for tasks that require new attention patterns. The authors conclude that prefix-tuning and prompting may be less efficient than training a transformer, but they provide a method for ensuring that a pretrained model can act as a token-wise universal approximator by including at least one attention head conforming to the structure in Lemma 3.This paper investigates whether prefix-tuning of a pretrained transformer can universally approximate sequence-to-sequence functions. The authors show that a single attention head can approximate any continuous function on a hypersphere, and that any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. They also provide Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision. The key findings are: 1. A single attention head is sufficient to approximate any smooth continuous function on the hypersphere $ S^m $ to any desired precision $ \epsilon $. 2. The required prompt length to approximate a smooth target function to a precision $ \epsilon $ is bounded. 3. Transformers of depth linear in the sequence length can approximate general sequence-to-sequence functions. 4. Prefix-tuning can result in elementwise functions that, when combined with cross-element mixing from the pretrained model, may explain the success of prefix-tuning and prompting. The paper also discusses the limitations of prefix-tuning, including the need for specific attention and value matrices, and the potential risks of using transformers for tasks that require new attention patterns. The authors conclude that prefix-tuning and prompting may be less efficient than training a transformer, but they provide a method for ensuring that a pretrained model can act as a token-wise universal approximator by including at least one attention head conforming to the structure in Lemma 3.
Reach us at info@study.space