Understanding Mechanistic Design and Scaling of Hybrid Architectures

The paper introduces a methodology called *mechanistic architecture design* (MAD) to streamline the development of deep learning architectures. MAD involves designing and testing new architectures using a set of synthetic tasks that probe specific model capabilities, such as recall and compression. These tasks are designed to be quick and inexpensive, allowing for rapid prototyping and evaluation. The authors experiment with various computational primitives, including gated convolutions, recurrences, and mixtures of experts, and identify new hybrid architectures that outperform state-of-the-art Transformer, convolutional, and recurrent models in terms of scaling efficiency. Key findings include: 1. **Hybridization and Sparsity**: New architectures based on hybridization and sparsity outperform traditional Transformer models. 2. **Scaling Laws**: An extensive scaling law analysis is conducted on over 500 language models, revealing that hybrid architectures improve scaling measures, including lower pretraining losses at different compute budgets. 3. **State-Optimal Scaling**: The total state dimension of emerging convolutional and recurrent primitives plays a crucial role in MAD and scaling analysis, leading to insights on optimal hybridization ratios and model topology. 4. **New State-of-the-Art Architectures**: New hybrid architectures designed using MAD outperform existing baselines by up to 20% in perplexity for the same compute budget. 5. **Correlation with Scaling Performance**: MAD scores are correlated with compute-optimal perplexity at scale, suggesting that small-scale synthetic tasks can predict large-scale performance. The paper also discusses the ethical implications of improved deep learning models, emphasizing the potential for more efficient and accessible large-scale training and inference.The paper introduces a methodology called *mechanistic architecture design* (MAD) to streamline the development of deep learning architectures. MAD involves designing and testing new architectures using a set of synthetic tasks that probe specific model capabilities, such as recall and compression. These tasks are designed to be quick and inexpensive, allowing for rapid prototyping and evaluation. The authors experiment with various computational primitives, including gated convolutions, recurrences, and mixtures of experts, and identify new hybrid architectures that outperform state-of-the-art Transformer, convolutional, and recurrent models in terms of scaling efficiency. Key findings include: 1. **Hybridization and Sparsity**: New architectures based on hybridization and sparsity outperform traditional Transformer models. 2. **Scaling Laws**: An extensive scaling law analysis is conducted on over 500 language models, revealing that hybrid architectures improve scaling measures, including lower pretraining losses at different compute budgets. 3. **State-Optimal Scaling**: The total state dimension of emerging convolutional and recurrent primitives plays a crucial role in MAD and scaling analysis, leading to insights on optimal hybridization ratios and model topology. 4. **New State-of-the-Art Architectures**: New hybrid architectures designed using MAD outperform existing baselines by up to 20% in perplexity for the same compute budget. 5. **Correlation with Scaling Performance**: MAD scores are correlated with compute-optimal perplexity at scale, suggesting that small-scale synthetic tasks can predict large-scale performance. The paper also discusses the ethical implications of improved deep learning models, emphasizing the potential for more efficient and accessible large-scale training and inference.

Mechanistic Design and Scaling of Hybrid Architectures

19 Aug 2024 | Michael Poli*,1,7, Armin W Thomas*,2,7, Eric Nguyen*,2, Pragaash Ponnusamy1, Björn Deiseroth3, Kristian Kersting3, Taiji Suzuki4, Brian Hie2,5, Stefano Ermon2,6, Christopher Ré2, Ce Zhang1, Stefano Massaroli1,7

19 Aug 2024 | Michael Poli,1,7, Armin W Thomas,2,7, Eric Nguyen*,2, Pragaash Ponnusamy1, Björn Deiseroth3, Kristian Kersting3, Taiji Suzuki4, Brian Hie2,5, Stefano Ermon2,6, Christopher Ré2, Ce Zhang1, Stefano Massaroli1,7