[slides and audio] FP6-LLM%3A Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

The paper introduces TC-FPx, a full-stack GPU kernel design that supports 6-bit quantization (FP6) for large language models (LLMs). FP6 quantization reduces the size of LLMs while preserving model quality, making it suitable for deployment on GPUs. However, existing systems lack Tensor Core support for FP6 quantization, leading to inefficiencies in inference performance. TC-FPx addresses these challenges by integrating Tensor Cores for efficient matrix multiplications and SIMT cores for weight de-quantization. The proposed design is evaluated through experiments on various LLMs, demonstrating significant improvements in inference throughput compared to FP16 and 8-bit quantization. Specifically, FP6-LLM, the end-to-end inference system built using TC-FPx, achieves 1.69×-2.65× higher normalized inference throughput for LLaMA-70b using a single GPU, and 1.72×-4.05× higher throughput for OPT-30b. The paper also provides detailed insights into the design choices, challenges, and performance analysis, highlighting the effectiveness of TC-FPx in mitigating the "memory wall" issue and improving the overall inference efficiency of LLMs.The paper introduces TC-FPx, a full-stack GPU kernel design that supports 6-bit quantization (FP6) for large language models (LLMs). FP6 quantization reduces the size of LLMs while preserving model quality, making it suitable for deployment on GPUs. However, existing systems lack Tensor Core support for FP6 quantization, leading to inefficiencies in inference performance. TC-FPx addresses these challenges by integrating Tensor Cores for efficient matrix multiplications and SIMT cores for weight de-quantization. The proposed design is evaluated through experiments on various LLMs, demonstrating significant improvements in inference throughput compared to FP16 and 8-bit quantization. Specifically, FP6-LLM, the end-to-end inference system built using TC-FPx, achieves 1.69×-2.65× higher normalized inference throughput for LLaMA-70b using a single GPU, and 1.72×-4.05× higher throughput for OPT-30b. The paper also provides detailed insights into the design choices, challenges, and performance analysis, highlighting the effectiveness of TC-FPx in mitigating the "memory wall" issue and improving the overall inference efficiency of LLMs.

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

4 Mar 2024 | Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuawen Leon Song