18 Jun 2024 | Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian
This paper presents a scalable MatMul-free language model (Matmul-free LM) that eliminates matrix multiplication (MatMul) operations while maintaining strong performance at billion-parameter scales. The model achieves performance comparable to state-of-the-art Transformers, requiring significantly less memory and computational resources. The key innovation is the use of ternary weights in dense layers and element-wise operations in self-attention, which replace traditional MatMul operations. The model is implemented with hardware-efficient techniques, including a fused BitLinear layer and a custom FPGA accelerator, which reduce memory usage by up to 61% during training and by over 10× during inference. The model also demonstrates improved efficiency on custom hardware, achieving 13W processing of billion-parameter models, approaching brain-like efficiency. The Matmul-free LM is shown to be more efficient than conventional Transformers, with a narrowing performance gap as model size increases. The paper also evaluates the model on various language tasks, demonstrating strong zero-shot performance. The results highlight the potential of MatMul-free architectures for efficient large language modeling, particularly in terms of memory usage and inference speed. The work challenges the assumption that MatMul operations are essential for high-performance language models and suggests that future accelerators should be optimized for lightweight operations. The code is available at https://github.com/ridgerchu/matmulfreellm.This paper presents a scalable MatMul-free language model (Matmul-free LM) that eliminates matrix multiplication (MatMul) operations while maintaining strong performance at billion-parameter scales. The model achieves performance comparable to state-of-the-art Transformers, requiring significantly less memory and computational resources. The key innovation is the use of ternary weights in dense layers and element-wise operations in self-attention, which replace traditional MatMul operations. The model is implemented with hardware-efficient techniques, including a fused BitLinear layer and a custom FPGA accelerator, which reduce memory usage by up to 61% during training and by over 10× during inference. The model also demonstrates improved efficiency on custom hardware, achieving 13W processing of billion-parameter models, approaching brain-like efficiency. The Matmul-free LM is shown to be more efficient than conventional Transformers, with a narrowing performance gap as model size increases. The paper also evaluates the model on various language tasks, demonstrating strong zero-shot performance. The results highlight the potential of MatMul-free architectures for efficient large language modeling, particularly in terms of memory usage and inference speed. The work challenges the assumption that MatMul operations are essential for high-performance language models and suggests that future accelerators should be optimized for lightweight operations. The code is available at https://github.com/ridgerchu/matmulfreellm.