FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

4 Mar 2024 | Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design This paper presents TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for various quantization bit-widths, enabling efficient inference of large language models (LLMs) using six-bit quantization (FP6). FP6 quantization reduces the size of LLMs and preserves model quality, allowing inference on a single GPU with significantly higher throughput than FP16. TC-FPx integrates with an existing inference system to provide end-to-end support for quantized LLM inference, achieving better trade-offs between inference cost and model quality. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm. The paper addresses the challenges of supporting FP6 quantization on GPUs, including unfriendly memory access for weights with irregular bit-width and high runtime overhead of weight de-quantization. TC-FPx introduces a unified kernel solution that leverages Tensor Cores for intensive computation and SIMT cores for weight de-quantization. It also proposes Ahead-of-time Bit-level Pre-packing to resolve memory access issues and SIMT-Efficient GPU Runtime to minimize de-quantization overhead. The software pipeline of TC-FPx enables efficient collaboration between SIMT cores, Tensor Cores, and GPU memory hierarchy. The paper evaluates the performance of TC-FPx on various LLM models, demonstrating that FP6-LLM substantially outperforms the FP16 baseline. It shows that FP6 quantization achieves better model quality than 4-bit quantization and provides significant speedups in inference. The paper also discusses the design choices and challenges of supporting FP6 quantization on GPUs, including the need for Tensor Core support and the challenge of irregular bit-width memory access. The paper concludes that FP6 quantization is a practical alternative for deploying LLMs with minimal accuracy loss.FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design This paper presents TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for various quantization bit-widths, enabling efficient inference of large language models (LLMs) using six-bit quantization (FP6). FP6 quantization reduces the size of LLMs and preserves model quality, allowing inference on a single GPU with significantly higher throughput than FP16. TC-FPx integrates with an existing inference system to provide end-to-end support for quantized LLM inference, achieving better trade-offs between inference cost and model quality. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm. The paper addresses the challenges of supporting FP6 quantization on GPUs, including unfriendly memory access for weights with irregular bit-width and high runtime overhead of weight de-quantization. TC-FPx introduces a unified kernel solution that leverages Tensor Cores for intensive computation and SIMT cores for weight de-quantization. It also proposes Ahead-of-time Bit-level Pre-packing to resolve memory access issues and SIMT-Efficient GPU Runtime to minimize de-quantization overhead. The software pipeline of TC-FPx enables efficient collaboration between SIMT cores, Tensor Cores, and GPU memory hierarchy. The paper evaluates the performance of TC-FPx on various LLM models, demonstrating that FP6-LLM substantially outperforms the FP16 baseline. It shows that FP6 quantization achieves better model quality than 4-bit quantization and provides significant speedups in inference. The paper also discusses the design choices and challenges of supporting FP6 quantization on GPUs, including the need for Tensor Core support and the challenge of irregular bit-width memory access. The paper concludes that FP6 quantization is a practical alternative for deploying LLMs with minimal accuracy loss.
Reach us at info@study.space