An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs

19 Jul 2024 | Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, and Michele Magno
This study evaluates the performance of LLaMA3 under low-bit quantization, focusing on both large language models (LLMs) and multimodal large language models (MLLMs). LLaMA3, released by Meta, is a powerful open-source LLM with impressive performance across various tasks, trained on over 15 trillion tokens. The study explores the effectiveness of low-bit quantization methods for LLaMA3, including post-training quantization (PTQ) and LoRA-finetuning (LoRA-FT), to assess their impact on model performance, especially in resource-constrained scenarios. The research evaluates multiple quantization methods across different bit-widths (1-8 bits) on diverse datasets, including language and visual language tasks. It also assesses the performance of the LLaVA-Next-8B model, a multimodal model based on LLaMA3, under ultra-low bit-widths. The results show that while LLaMA3 still performs well after quantization, there is significant performance degradation, particularly under ultra-low bit-widths. This degradation is more pronounced in visual and linguistic contexts, highlighting the challenges of compressing LLaMA3 for practical deployment. The study compares various PTQ and LoRA-FT methods, finding that some methods, such as SliM-LLM and BiLLM, achieve high accuracy at low bit-widths. However, LoRA-FT methods, like QLoRA and IR-QLoRA, show limited effectiveness in compensating for quantization errors, especially when using smaller datasets like Alpaca. The results also indicate that LLaMA3-70B is more robust to quantization than LLaMA3-8B. For MLLMs, the study finds that even with advanced PTQ methods, performance degradation is significant at ultra-low bit-widths, with some models experiencing complete functional collapse. The study highlights the need for better quantization techniques to address these challenges, especially for MLLMs. The findings suggest that while LLaMA3 is a powerful model, its performance under low-bit quantization requires further optimization to ensure practical deployment. The study provides valuable insights for future research in LLM and MLLM quantization, aiming to improve model efficiency and accuracy.This study evaluates the performance of LLaMA3 under low-bit quantization, focusing on both large language models (LLMs) and multimodal large language models (MLLMs). LLaMA3, released by Meta, is a powerful open-source LLM with impressive performance across various tasks, trained on over 15 trillion tokens. The study explores the effectiveness of low-bit quantization methods for LLaMA3, including post-training quantization (PTQ) and LoRA-finetuning (LoRA-FT), to assess their impact on model performance, especially in resource-constrained scenarios. The research evaluates multiple quantization methods across different bit-widths (1-8 bits) on diverse datasets, including language and visual language tasks. It also assesses the performance of the LLaVA-Next-8B model, a multimodal model based on LLaMA3, under ultra-low bit-widths. The results show that while LLaMA3 still performs well after quantization, there is significant performance degradation, particularly under ultra-low bit-widths. This degradation is more pronounced in visual and linguistic contexts, highlighting the challenges of compressing LLaMA3 for practical deployment. The study compares various PTQ and LoRA-FT methods, finding that some methods, such as SliM-LLM and BiLLM, achieve high accuracy at low bit-widths. However, LoRA-FT methods, like QLoRA and IR-QLoRA, show limited effectiveness in compensating for quantization errors, especially when using smaller datasets like Alpaca. The results also indicate that LLaMA3-70B is more robust to quantization than LLaMA3-8B. For MLLMs, the study finds that even with advanced PTQ methods, performance degradation is significant at ultra-low bit-widths, with some models experiencing complete functional collapse. The study highlights the need for better quantization techniques to address these challenges, especially for MLLMs. The findings suggest that while LLaMA3 is a powerful model, its performance under low-bit quantization requires further optimization to ensure practical deployment. The study provides valuable insights for future research in LLM and MLLM quantization, aiming to improve model efficiency and accuracy.
Reach us at info@study.space