Compute Better Spent: Replacing Dense Layers with Structured Matrices

Compute Better Spent: Replacing Dense Layers with Structured Matrices

10 Jun 2024 | Shikai Qiu * 1 Andres Potapczynski * 1 Marc Finzi 2 Micah Goldblum 1 Andrew Gordon Wilson 1
This paper explores the use of structured matrices as replacements for dense layers in neural networks to improve computational efficiency. The authors systematically investigate different structured matrix types, including low-rank matrices, convolutions, Kronecker products, Monarch matrices, Tensor-Train (TT), and Block Tensor-Train (BTT). They find that structured matrices often require different initialization scales and learning rates compared to dense layers, which are crucial for performance. Using the Maximal Update Parameterization (μP), they determine optimal scaling for these parameters. The study measures the scaling laws of different structures to compare their performance improvement with compute. The proposed BTT family, which includes Monarch matrices, outperforms dense matrices in multiple tasks, achieving significantly lower training loss on CIFAR-10/100 with data augmentation and matching dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute. The paper also discusses the importance of structure-aware learning rate scaling and the impact of compute per dimension on efficiency. Finally, it applies structured layers to larger transformer models, demonstrating improved compute efficiency on ImageNet and language modeling tasks.This paper explores the use of structured matrices as replacements for dense layers in neural networks to improve computational efficiency. The authors systematically investigate different structured matrix types, including low-rank matrices, convolutions, Kronecker products, Monarch matrices, Tensor-Train (TT), and Block Tensor-Train (BTT). They find that structured matrices often require different initialization scales and learning rates compared to dense layers, which are crucial for performance. Using the Maximal Update Parameterization (μP), they determine optimal scaling for these parameters. The study measures the scaling laws of different structures to compare their performance improvement with compute. The proposed BTT family, which includes Monarch matrices, outperforms dense matrices in multiple tasks, achieving significantly lower training loss on CIFAR-10/100 with data augmentation and matching dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute. The paper also discusses the importance of structure-aware learning rate scaling and the impact of compute per dimension on efficiency. Finally, it applies structured layers to larger transformer models, demonstrating improved compute efficiency on ImageNet and language modeling tasks.
Reach us at info@study.space