Understanding Fast Matrix Multiplications for Lookup Table-Quantized LLMs

The paper introduces FLUTE, a flexible lookup table (LUT) engine designed for fast matrix multiplications in weight-quantized large language models (LLMs). FLUTE addresses the memory bandwidth bottleneck in LLM inference by reducing the amount of data movement between the GPU's global memory and registers. It supports non-uniform LUT quantization, which is crucial for achieving competitive performance with low-bit quantization (e.g., 3 bits). The key contributions of FLUTE include: 1. **Offline Weight Restructuring**: FLUTE restructures the quantized weight matrix to minimize bit manipulations during dequantization. 2. **Vectorized Lookup Table**: It vectorizes the lookup table to reduce memory access overhead and improve performance. 3. **Stream-K Workload Partitioning**: FLUTE uses Stream-K partitioning to optimize workload distribution and avoid wave quantization, which can lead to idle GPU cores. FLUTE is evaluated on standard LLM mixed-precision matmul settings and shows significant speedups compared to existing non-uniform quantization kernels. The paper also demonstrates the effectiveness of FLUTE by applying it to quantize LLaMA3 models using a variant of NormalFloat (NF) quantization, achieving a 1.5 to 2 times increase in end-to-end throughput. The experiments highlight the flexibility and performance of FLUTE across various matrix shapes and bit widths, making it a valuable tool for accelerating LLM inference.The paper introduces FLUTE, a flexible lookup table (LUT) engine designed for fast matrix multiplications in weight-quantized large language models (LLMs). FLUTE addresses the memory bandwidth bottleneck in LLM inference by reducing the amount of data movement between the GPU's global memory and registers. It supports non-uniform LUT quantization, which is crucial for achieving competitive performance with low-bit quantization (e.g., 3 bits). The key contributions of FLUTE include: 1. **Offline Weight Restructuring**: FLUTE restructures the quantized weight matrix to minimize bit manipulations during dequantization. 2. **Vectorized Lookup Table**: It vectorizes the lookup table to reduce memory access overhead and improve performance. 3. **Stream-K Workload Partitioning**: FLUTE uses Stream-K partitioning to optimize workload distribution and avoid wave quantization, which can lead to idle GPU cores. FLUTE is evaluated on standard LLM mixed-precision matmul settings and shows significant speedups compared to existing non-uniform quantization kernels. The paper also demonstrates the effectiveness of FLUTE by applying it to quantize LLaMA3 models using a variant of NormalFloat (NF) quantization, achieving a 1.5 to 2 times increase in end-to-end throughput. The experiments highlight the flexibility and performance of FLUTE across various matrix shapes and bit widths, making it a valuable tool for accelerating LLM inference.

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

27 Aug 2024 | Han Guo*, William Brandon*, Radostin Cholakov†, Jonathan Ragan-Kelley*, Eric P. Xing°, Yoon Kim*

27 Aug 2024 | Han Guo, William Brandon, Radostin Cholakov†, Jonathan Ragan-Kelley, Eric P. Xing°, Yoon Kim