10 May 2024 | Jean Mercat*, Igor Vasiljevic*, Sedrick Keh*, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar
This paper introduces SUPRA, a method to convert large pre-trained transformers into recurrent neural networks (RNNs) with minimal computational cost. The approach leverages existing strong pre-trained transformers, such as Llama2 and Mistral, and linearizes them into RNNs using a modest fraction of pre-training data. This allows the model to inherit the performance of the pre-trained transformer while reducing inference cost and memory usage. The linearization process involves replacing softmax attention with a linear kernel and a normalization strategy, enabling the model to operate as an RNN during inference.
The paper evaluates the performance of the linearized models on standard language understanding benchmarks and long-context tasks. The results show that the linearized models perform competitively with the best linear transformers on standard benchmarks but struggle with in-context learning and long-context tasks. The paper also compares the performance of different linear models, including RWKV, Mamba, and RecurrentGemma, and finds that the linearized models do not match the performance of the strongest transformers on long-context tasks.
The paper also investigates the limitations of linear models, including their inability to handle long-context tasks and their lower performance on in-context learning tasks. The authors suggest that more sophisticated recurrent state update rules may be needed to improve performance on these tasks. The paper concludes that using SUPRA from a strong pre-trained model is the best option given a limited training budget, as it allows for the study of the strengths and limitations of recurrent models with minimal computational cost.This paper introduces SUPRA, a method to convert large pre-trained transformers into recurrent neural networks (RNNs) with minimal computational cost. The approach leverages existing strong pre-trained transformers, such as Llama2 and Mistral, and linearizes them into RNNs using a modest fraction of pre-training data. This allows the model to inherit the performance of the pre-trained transformer while reducing inference cost and memory usage. The linearization process involves replacing softmax attention with a linear kernel and a normalization strategy, enabling the model to operate as an RNN during inference.
The paper evaluates the performance of the linearized models on standard language understanding benchmarks and long-context tasks. The results show that the linearized models perform competitively with the best linear transformers on standard benchmarks but struggle with in-context learning and long-context tasks. The paper also compares the performance of different linear models, including RWKV, Mamba, and RecurrentGemma, and finds that the linearized models do not match the performance of the strongest transformers on long-context tasks.
The paper also investigates the limitations of linear models, including their inability to handle long-context tasks and their lower performance on in-context learning tasks. The authors suggest that more sophisticated recurrent state update rules may be needed to improve performance on these tasks. The paper concludes that using SUPRA from a strong pre-trained model is the best option given a limited training budget, as it allows for the study of the strengths and limitations of recurrent models with minimal computational cost.