Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

2024 | Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee
This paper introduces any-precision LLM, a method for efficiently deploying multiple, different-sized Large Language Models (LLMs) at low cost. The approach extends the concept of any-precision Deep Neural Networks (DNNs) to LLMs, enabling the use of varying bit-widths (e.g., 3, 4, ..., n bits) for different LLMs. By quantizing LLMs to different bit-widths and overlaying them into a memory footprint comparable to a single n-bit LLM, the solution significantly reduces deployment costs. The method leverages post-training quantization (PTQ) and a specialized software engine to achieve efficient serving. The solution supports state-of-the-art model quality and inference throughput across various bit-widths, making it a compelling option for deploying multiple LLMs. The paper also addresses challenges in implementing any-precision LLM, including the need for a lightweight quantization method and a new GPU kernel for quantized matrix-vector multiplication. The proposed solution demonstrates high inference throughput, matching or outperforming existing quantized matrix-vector multiplication engines. The paper evaluates the performance of any-precision LLM on various models and tasks, showing that it achieves state-of-the-art results with minimal memory overhead. The solution is particularly effective for on-device inference, where memory and computational resources are limited. The paper concludes that any-precision LLM provides a cost-effective and memory-efficient way to deploy multiple, different-sized LLMs.This paper introduces any-precision LLM, a method for efficiently deploying multiple, different-sized Large Language Models (LLMs) at low cost. The approach extends the concept of any-precision Deep Neural Networks (DNNs) to LLMs, enabling the use of varying bit-widths (e.g., 3, 4, ..., n bits) for different LLMs. By quantizing LLMs to different bit-widths and overlaying them into a memory footprint comparable to a single n-bit LLM, the solution significantly reduces deployment costs. The method leverages post-training quantization (PTQ) and a specialized software engine to achieve efficient serving. The solution supports state-of-the-art model quality and inference throughput across various bit-widths, making it a compelling option for deploying multiple LLMs. The paper also addresses challenges in implementing any-precision LLM, including the need for a lightweight quantization method and a new GPU kernel for quantized matrix-vector multiplication. The proposed solution demonstrates high inference throughput, matching or outperforming existing quantized matrix-vector multiplication engines. The paper evaluates the performance of any-precision LLM on various models and tasks, showing that it achieves state-of-the-art results with minimal memory overhead. The solution is particularly effective for on-device inference, where memory and computational resources are limited. The paper concludes that any-precision LLM provides a cost-effective and memory-efficient way to deploy multiple, different-sized LLMs.
Reach us at info@study.space