Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

2024 | Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee
This paper introduces the concept of *any-precision LLM* (Large Language Model), which extends the idea of any-precision DNN to LLMs. It addresses the high deployment costs associated with multiple, differently sized LLMs by proposing a lightweight method for any-precision quantization and developing a specialized software engine. The method involves post-training quantization, where a low-bit model is first generated and then incrementally upscaled to higher bit-widths, conserving memory. The software engine supports efficient serving by changing the memory layout of weights, enabling the use of multiple LLMs with varying bit-widths in a single memory footprint. Experimental results demonstrate that the proposed solution significantly reduces deployment costs, maintains state-of-the-art model quality, and achieves high inference throughput, making it a compelling option for deploying multiple, different-sized LLMs. The code is available at <https://github.com/SNU-ARC/any-precision-llm>.This paper introduces the concept of *any-precision LLM* (Large Language Model), which extends the idea of any-precision DNN to LLMs. It addresses the high deployment costs associated with multiple, differently sized LLMs by proposing a lightweight method for any-precision quantization and developing a specialized software engine. The method involves post-training quantization, where a low-bit model is first generated and then incrementally upscaled to higher bit-widths, conserving memory. The software engine supports efficient serving by changing the memory layout of weights, enabling the use of multiple LLMs with varying bit-widths in a single memory footprint. Experimental results demonstrate that the proposed solution significantly reduces deployment costs, maintains state-of-the-art model quality, and achieves high inference throughput, making it a compelling option for deploying multiple, different-sized LLMs. The code is available at <https://github.com/SNU-ARC/any-precision-llm>.
Reach us at info@study.space