13 May 2024 | Andrei Tomut, Saeed S. Jahromi, Abhijoy Sarkar, Uygar Kurt, Sukhinder Singh, Faysal Ishtiaq, César Muñoz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, Mehrazin Alizadeh, David Montero, Pablo Martín-Ramiro, Muhammad Ibrahim, Oussama Tahiri Alaoui, John Malcolm, Samuel Mugel, and Román Orús
This paper introduces CompactifAI, a novel method for compressing Large Language Models (LLMs) using quantum-inspired tensor networks (TNs). Unlike traditional compression techniques that focus on reducing the number of neurons or numerical precision, CompactifAI targets the correlation space within the model, enabling more controlled and interpretable compression. The method involves tensorizing key layers (self-attention and MLP) using Matrix Product Operators (MPOs), which effectively truncate the correlations in the model while maintaining accuracy. The bond dimension of the MPO controls the level of compression, with smaller dimensions leading to greater compression but potentially lower accuracy. A retraining phase, called "healing," is used to restore accuracy after compression.
The results show that combining CompactifAI with quantization reduces the memory size of the LLaMA-2 7B model by 93%, the number of parameters by 70%, and accelerates training and inference times by 50% and 25%, respectively, with only a 2-3% accuracy drop. The method also allows for layer sensitivity profiling, revealing that deeper layers are more suitable for tensor network compression, aligning with recent findings on the ineffectiveness of deeper layers in LLM performance. The results suggest that standard LLMs are heavily overparametrized and do not need to be large.
The method is versatile and can be implemented alongside other compression techniques. It is compatible with distributed training and inference, leading to significant speedups. The paper also demonstrates that tensorized models are more efficient in terms of memory and computation, and that the compression process can be fine-tuned to achieve optimal performance. The results highlight the potential of tensor network compression for making LLMs more efficient and accessible, enabling smaller models that can be deployed on-premises without relying on cloud infrastructure. The work provides a more refined, controllable, and explainable approach to LLM compression compared to traditional methods.This paper introduces CompactifAI, a novel method for compressing Large Language Models (LLMs) using quantum-inspired tensor networks (TNs). Unlike traditional compression techniques that focus on reducing the number of neurons or numerical precision, CompactifAI targets the correlation space within the model, enabling more controlled and interpretable compression. The method involves tensorizing key layers (self-attention and MLP) using Matrix Product Operators (MPOs), which effectively truncate the correlations in the model while maintaining accuracy. The bond dimension of the MPO controls the level of compression, with smaller dimensions leading to greater compression but potentially lower accuracy. A retraining phase, called "healing," is used to restore accuracy after compression.
The results show that combining CompactifAI with quantization reduces the memory size of the LLaMA-2 7B model by 93%, the number of parameters by 70%, and accelerates training and inference times by 50% and 25%, respectively, with only a 2-3% accuracy drop. The method also allows for layer sensitivity profiling, revealing that deeper layers are more suitable for tensor network compression, aligning with recent findings on the ineffectiveness of deeper layers in LLM performance. The results suggest that standard LLMs are heavily overparametrized and do not need to be large.
The method is versatile and can be implemented alongside other compression techniques. It is compatible with distributed training and inference, leading to significant speedups. The paper also demonstrates that tensorized models are more efficient in terms of memory and computation, and that the compression process can be fine-tuned to achieve optimal performance. The results highlight the potential of tensor network compression for making LLMs more efficient and accessible, enabling smaller models that can be deployed on-premises without relying on cloud infrastructure. The work provides a more refined, controllable, and explainable approach to LLM compression compared to traditional methods.