QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA: Efficient Finetuning of Quantized LLMs

23 May 2023 | Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
QLoRA is an efficient method for fine-tuning large language models (LLMs) with reduced memory usage, enabling the fine-tuning of a 65B parameter model on a single 48GB GPU while maintaining 16-bit fine-tuning performance. QLoRA uses 4-bit NormalFloat (NF4) quantization, double quantization, and paged optimizers to reduce memory without sacrificing performance. The Guanaco model, trained with QLoRA, outperforms previous models on the Vicuna benchmark, achieving 99.3% of ChatGPT's performance with only 24 hours of fine-tuning on a single GPU. QLoRA reduces the average memory required for fine-tuning a 65B model from over 780GB to less than 48GB. The method also enables the training of over 1,000 models across various sizes and architectures, providing insights into instruction following and chatbot performance. QLoRA's efficiency allows for in-depth studies of model scales that were previously infeasible due to memory constraints. The results show that QLoRA can recover 16-bit performance using a small, high-quality dataset, even with smaller models than the previous state-of-the-art. QLoRA's performance is validated through human and GPT-4 evaluations, which show that GPT-4 is a reliable alternative to human evaluation. The study also highlights the limitations of current chatbot benchmarks and the importance of using diverse evaluation methods. QLoRA's models, including Guanaco, demonstrate strong performance on the Vicuna and OA benchmarks, with Guanaco 65B achieving 99.3% of ChatGPT's performance. The method's efficiency and effectiveness make it a promising approach for training state-of-the-art chatbots on large-scale data.QLoRA is an efficient method for fine-tuning large language models (LLMs) with reduced memory usage, enabling the fine-tuning of a 65B parameter model on a single 48GB GPU while maintaining 16-bit fine-tuning performance. QLoRA uses 4-bit NormalFloat (NF4) quantization, double quantization, and paged optimizers to reduce memory without sacrificing performance. The Guanaco model, trained with QLoRA, outperforms previous models on the Vicuna benchmark, achieving 99.3% of ChatGPT's performance with only 24 hours of fine-tuning on a single GPU. QLoRA reduces the average memory required for fine-tuning a 65B model from over 780GB to less than 48GB. The method also enables the training of over 1,000 models across various sizes and architectures, providing insights into instruction following and chatbot performance. QLoRA's efficiency allows for in-depth studies of model scales that were previously infeasible due to memory constraints. The results show that QLoRA can recover 16-bit performance using a small, high-quality dataset, even with smaller models than the previous state-of-the-art. QLoRA's performance is validated through human and GPT-4 evaluations, which show that GPT-4 is a reliable alternative to human evaluation. The study also highlights the limitations of current chatbot benchmarks and the importance of using diverse evaluation methods. QLoRA's models, including Guanaco, demonstrate strong performance on the Vicuna and OA benchmarks, with Guanaco 65B achieving 99.3% of ChatGPT's performance. The method's efficiency and effectiveness make it a promising approach for training state-of-the-art chatbots on large-scale data.
Reach us at info@study.space