[slides and audio] QLoRA%3A Efficient Finetuning of Quantized LLMs

QLoRA is an efficient fine-tuning approach that reduces memory usage to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit performance. It uses 4-bit NormalFloat (NF4) quantization, Double Quantization, and Paged Optimizers to manage memory and improve efficiency. The best model family, named Guanaco, outperforms previous models on the Vicuna benchmark, reaching 99.3% of ChatGPT's performance with only 24 hours of fine-tuning on a single GPU. QLoRA introduces several innovations, including NF4, which is information-theoretically optimal for normally distributed weights, Double Quantization to reduce memory footprint, and Paged Optimizers to handle memory spikes. The authors trained over 1,000 models across various instruction datasets, model architectures, and scales, demonstrating that QLORa can achieve state-of-the-art results with smaller models. They also provide a detailed analysis of chatbot performance using both human and GPT-4 evaluations, highlighting the limitations of current chatbot benchmarks. The paper releases all models and code, including CUDA kernels for 4-bit training, to facilitate further research.QLoRA is an efficient fine-tuning approach that reduces memory usage to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit performance. It uses 4-bit NormalFloat (NF4) quantization, Double Quantization, and Paged Optimizers to manage memory and improve efficiency. The best model family, named Guanaco, outperforms previous models on the Vicuna benchmark, reaching 99.3% of ChatGPT's performance with only 24 hours of fine-tuning on a single GPU. QLoRA introduces several innovations, including NF4, which is information-theoretically optimal for normally distributed weights, Double Quantization to reduce memory footprint, and Paged Optimizers to handle memory spikes. The authors trained over 1,000 models across various instruction datasets, model architectures, and scales, demonstrating that QLORa can achieve state-of-the-art results with smaller models. They also provide a detailed analysis of chatbot performance using both human and GPT-4 evaluations, highlighting the limitations of current chatbot benchmarks. The paper releases all models and code, including CUDA kernels for 4-bit training, to facilitate further research.

QLoRA: Efficient Finetuning of Quantized LLMs

23 May 2023 | Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer