19 Aug 2024 | Michael Poli*, Armin W Thomas*, Eric Nguyen*, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli
This paper introduces Mechanistic Architecture Design (MAD), a framework for accelerating the design of deep learning architectures by using small synthetic tasks to evaluate and predict scaling performance. The goal is to identify and validate new hybrid architectures that outperform existing ones in terms of compute-optimal and state-optimal scaling. MAD involves a set of synthetic tasks, such as recall, memorization, and compression, designed to test specific model capabilities. These tasks are used to evaluate architectures at small scales, providing insights into their performance at larger scales.
The study evaluates a wide range of architectures, including hybrid models combining different computational primitives like attention, convolution, and recurrence. The results show that hybrid architectures, particularly those with sparsity and hybridization, outperform state-of-the-art models like Transformers, Convolutional, and Recurrent architectures in scaling performance. The study also introduces state-optimal scaling laws, which consider the impact of state size on inference efficiency and memory cost.
The paper presents a comprehensive scaling law analysis of over 500 language models with parameters ranging from 70M to 7B. The analysis reveals that hybrid architectures achieve better scaling performance, especially in compute-optimal and overtrained regimes. The results also show that MAD synthetic tasks can reliably predict scaling performance, enabling faster and more efficient architecture design.
The study highlights the importance of hybridization and sparsity in improving model performance and efficiency. It also demonstrates that MAD can be used to identify and validate new architectures that outperform existing ones in terms of scaling and performance. The findings suggest that MAD provides a valuable tool for accelerating the development of efficient and effective deep learning architectures.This paper introduces Mechanistic Architecture Design (MAD), a framework for accelerating the design of deep learning architectures by using small synthetic tasks to evaluate and predict scaling performance. The goal is to identify and validate new hybrid architectures that outperform existing ones in terms of compute-optimal and state-optimal scaling. MAD involves a set of synthetic tasks, such as recall, memorization, and compression, designed to test specific model capabilities. These tasks are used to evaluate architectures at small scales, providing insights into their performance at larger scales.
The study evaluates a wide range of architectures, including hybrid models combining different computational primitives like attention, convolution, and recurrence. The results show that hybrid architectures, particularly those with sparsity and hybridization, outperform state-of-the-art models like Transformers, Convolutional, and Recurrent architectures in scaling performance. The study also introduces state-optimal scaling laws, which consider the impact of state size on inference efficiency and memory cost.
The paper presents a comprehensive scaling law analysis of over 500 language models with parameters ranging from 70M to 7B. The analysis reveals that hybrid architectures achieve better scaling performance, especially in compute-optimal and overtrained regimes. The results also show that MAD synthetic tasks can reliably predict scaling performance, enabling faster and more efficient architecture design.
The study highlights the importance of hybridization and sparsity in improving model performance and efficiency. It also demonstrates that MAD can be used to identify and validate new architectures that outperform existing ones in terms of scaling and performance. The findings suggest that MAD provides a valuable tool for accelerating the development of efficient and effective deep learning architectures.