27 Aug 2024 | Han Guo*, William Brandon*, Radostin Cholakov†, Jonathan Ragan-Kelley*, Eric P. Xing‡, Yoon Kim*
FLUTE is a flexible lookup table engine for LUT-quantized large language models (LLMs) that improves inference speed by reducing memory bandwidth constraints. The paper addresses the challenges of weight-only quantization, especially with non-uniform, lookup table (LUT) quantization, which is common in low-bit (e.g., 3-bit) settings. FLUTE uses offline restructuring of quantized weight matrices to minimize bit manipulations during unpacking and vectorizes and duplicates the lookup table to mitigate shared memory bandwidth constraints. It also employs Stream-K partitioning for optimized workload distribution. FLUTE outperforms existing non-uniform quantization kernels and matches simpler uniform-quantization kernels in some cases. When applied to LLaMA3, FLUTE achieves a 1.5 to 2 times increase in end-to-end throughput. The paper also explores a simple extension to LUT-based NormalFloat quantization, which enables competitive quantization performance against strong baselines. FLUTE is designed for memory-bound scenarios, such as LLM decoding, and is optimized for Ampere-generation GPUs. However, its performance on A100s falls short of the peak performance of kernels specialized for uniformly quantized matrices. The work highlights the importance of flexible kernels that can support a wide range of quantization settings, especially with non-uniform and low-bit quantization.FLUTE is a flexible lookup table engine for LUT-quantized large language models (LLMs) that improves inference speed by reducing memory bandwidth constraints. The paper addresses the challenges of weight-only quantization, especially with non-uniform, lookup table (LUT) quantization, which is common in low-bit (e.g., 3-bit) settings. FLUTE uses offline restructuring of quantized weight matrices to minimize bit manipulations during unpacking and vectorizes and duplicates the lookup table to mitigate shared memory bandwidth constraints. It also employs Stream-K partitioning for optimized workload distribution. FLUTE outperforms existing non-uniform quantization kernels and matches simpler uniform-quantization kernels in some cases. When applied to LLaMA3, FLUTE achieves a 1.5 to 2 times increase in end-to-end throughput. The paper also explores a simple extension to LUT-based NormalFloat quantization, which enables competitive quantization performance against strong baselines. FLUTE is designed for memory-bound scenarios, such as LLM decoding, and is optimized for Ampere-generation GPUs. However, its performance on A100s falls short of the peak performance of kernels specialized for uniformly quantized matrices. The work highlights the importance of flexible kernels that can support a wide range of quantization settings, especially with non-uniform and low-bit quantization.