30 May 2024 | Vladimir Malinovskii†, Denis Mazur†, Ivan Ilin†, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh‡, Peter Richtarik‡
The paper "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression" addresses the challenge of compressing large language models (LLMs) to 1-2 bits per parameter, a task that allows these models to run efficiently on resource-constrained devices. The authors critique the use of straight-through estimators (STE) in fine-tuning compressed weights, which are known to be sub-optimal and can lead to poor performance. They propose PV-Tuning, a novel framework that generalizes and improves upon existing fine-tuning strategies, providing convergence guarantees in certain cases. PV-Tuning is designed to optimize both continuous and discrete parameters simultaneously, using a combination of gradient descent and discrete updates. The method is shown to outperform prior techniques in terms of accuracy and compression efficiency, achieving Pareto-optimal quantization for Llama-2 models at around 2 bits per parameter. The paper also includes a detailed analysis of different quantized representations and fine-tuning algorithms, demonstrating the effectiveness of PV-Tuning across various models and bitwidths.The paper "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression" addresses the challenge of compressing large language models (LLMs) to 1-2 bits per parameter, a task that allows these models to run efficiently on resource-constrained devices. The authors critique the use of straight-through estimators (STE) in fine-tuning compressed weights, which are known to be sub-optimal and can lead to poor performance. They propose PV-Tuning, a novel framework that generalizes and improves upon existing fine-tuning strategies, providing convergence guarantees in certain cases. PV-Tuning is designed to optimize both continuous and discrete parameters simultaneously, using a combination of gradient descent and discrete updates. The method is shown to outperform prior techniques in terms of accuracy and compression efficiency, achieving Pareto-optimal quantization for Llama-2 models at around 2 bits per parameter. The paper also includes a detailed analysis of different quantized representations and fine-tuning algorithms, demonstrating the effectiveness of PV-Tuning across various models and bitwidths.