28 Feb 2024 | James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai
BitDelta is a novel method that quantizes the weight delta between a fine-tuned model and its base model to just 1 bit, significantly reducing storage and GPU memory requirements. By decomposing the fine-tuned model weights into pre-trained components and an additional *delta*, BitDelta achieves minimal performance degradation while enabling the use of a single high-precision base model with multiple 1-bit deltas. This approach reduces GPU memory consumption by more than 10×, enhancing multi-tenant serving and improving generation latency. BitDelta is validated through experiments on Llama-2 and Mistral models, demonstrating its effectiveness across various model sizes and fine-tuning techniques. The method also shows promise in improving multi-tenant serving systems and reducing inference costs, making it a promising solution for efficient and scalable deployment of fine-tuned models.BitDelta is a novel method that quantizes the weight delta between a fine-tuned model and its base model to just 1 bit, significantly reducing storage and GPU memory requirements. By decomposing the fine-tuned model weights into pre-trained components and an additional *delta*, BitDelta achieves minimal performance degradation while enabling the use of a single high-precision base model with multiple 1-bit deltas. This approach reduces GPU memory consumption by more than 10×, enhancing multi-tenant serving and improving generation latency. BitDelta is validated through experiments on Llama-2 and Mistral models, demonstrating its effectiveness across various model sizes and fine-tuning techniques. The method also shows promise in improving multi-tenant serving systems and reducing inference costs, making it a promising solution for efficient and scalable deployment of fine-tuned models.