[slides] QuIP%23%3A Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

QuIP# is a post-training quantization method for large language models (LLMs) that achieves state-of-the-art results in extreme compression regimes (≤ 4 bits per weight) using three novel techniques: incoherence processing, lattice codebooks, and fine-tuning. The method improves upon existing methods by: 1. **Incoherence Processing**: Utilizes the Randomized Hadamard Transform (RHT) to enhance incoherence properties and reduce computational costs compared to the Kronecker factorization in QuIP. RHT provides better theoretical bounds and faster runtime. 2. **Lattice Codebooks**: Uses a highly structured 2-bit codebook based on the $E_8$ lattice, which achieves the highest density 8-dimensional unit ball packing. This allows for efficient and accurate quantization with fast inference. 3. **Fine-Tuning**: Introduces an inter-layer fine-tuning algorithm to improve the fidelity of the quantized model to the original model. Experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. The method is designed to be fast, with a "proof of concept" CUDA implementation achieving over 50% of peak memory bandwidth on a NVIDIA RTX 4090. QuIP# also demonstrates superior scaling at 3 bits over 4 bits and similar scaling at 2 bits to higher bitrates, indicating that 2-bit models may scale better than 3-bit models in the near future.QuIP# is a post-training quantization method for large language models (LLMs) that achieves state-of-the-art results in extreme compression regimes (≤ 4 bits per weight) using three novel techniques: incoherence processing, lattice codebooks, and fine-tuning. The method improves upon existing methods by: 1. **Incoherence Processing**: Utilizes the Randomized Hadamard Transform (RHT) to enhance incoherence properties and reduce computational costs compared to the Kronecker factorization in QuIP. RHT provides better theoretical bounds and faster runtime. 2. **Lattice Codebooks**: Uses a highly structured 2-bit codebook based on the $E_8$ lattice, which achieves the highest density 8-dimensional unit ball packing. This allows for efficient and accurate quantization with fast inference. 3. **Fine-Tuning**: Introduces an inter-layer fine-tuning algorithm to improve the fidelity of the quantized model to the original model. Experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. The method is designed to be fast, with a "proof of concept" CUDA implementation achieving over 50% of peak memory bandwidth on a NVIDIA RTX 4090. QuIP# also demonstrates superior scaling at 3 bits over 4 bits and similar scaling at 2 bits to higher bitrates, indicating that 2-bit models may scale better than 3-bit models in the near future.

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

4 Jun 2024 | Albert Tseng †, Jerry Chee †, Qingyao Sun 2, Volodymyr Kuleshov 1, Christopher De Sa 1

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

4 Jun 2024 | Albert Tseng *†, Jerry Chee *†, Qingyao Sun 2, Volodymyr Kuleshov 1, Christopher De Sa 1

4 Jun 2024 | Albert Tseng †, Jerry Chee †, Qingyao Sun 2, Volodymyr Kuleshov 1, Christopher De Sa 1