[slides and audio] CompactifAI%3A Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

The paper introduces *CompactifAI*, an innovative compression method for Large Language Models (LLMs) using quantum-inspired Tensor Networks. This approach focuses on reducing the model's correlation space, allowing for more controlled and interpretable compression. The method is versatile and can be combined with other compression techniques. The benchmark results show that *CompactifAI* combined with quantization reduces the memory size of LlaMA-2 7B by 93%, reduces the number of parameters by 70%, accelerates training by 50%, and inference by 25%, with only a 2-3% accuracy drop. The study also reveals that deeper layers are more suitable for tensor network compression, supporting recent observations on the ineffectiveness of deeper layers in LLM performance. The results suggest that standard LLMs are heavily overparametrized and do not need to be as large as they currently are.The paper introduces *CompactifAI*, an innovative compression method for Large Language Models (LLMs) using quantum-inspired Tensor Networks. This approach focuses on reducing the model's correlation space, allowing for more controlled and interpretable compression. The method is versatile and can be combined with other compression techniques. The benchmark results show that *CompactifAI* combined with quantization reduces the memory size of LlaMA-2 7B by 93%, reduces the number of parameters by 70%, accelerates training by 50%, and inference by 25%, with only a 2-3% accuracy drop. The study also reveals that deeper layers are more suitable for tensor network compression, supporting recent observations on the ineffectiveness of deeper layers in LLM performance. The results suggest that standard LLMs are heavily overparametrized and do not need to be as large as they currently are.

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks