EfficientQAT is a novel quantization-aware training (QAT) method designed to efficiently compress large language models (LLMs). It introduces two key phases: Block-wise Training of All Parameters (Block-AP) and End-to-End Training of Quantization Parameters (E2E-QP). Block-AP trains all parameters in each transformer block with block-wise reconstruction, reducing memory usage and improving efficiency. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further enhancing quantization performance. EfficientQAT outperforms existing methods in terms of accuracy and efficiency across various models and quantization levels. It achieves a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours with less than 3% accuracy degradation. EfficientQAT also excels in instruction tuning, outperforming existing Q-PEFT methods. The method is memory-efficient and hardware-friendly, making it suitable for deployment on memory-limited platforms. The results demonstrate that EfficientQAT significantly improves the performance of LLMs while maintaining low memory usage and training efficiency.EfficientQAT is a novel quantization-aware training (QAT) method designed to efficiently compress large language models (LLMs). It introduces two key phases: Block-wise Training of All Parameters (Block-AP) and End-to-End Training of Quantization Parameters (E2E-QP). Block-AP trains all parameters in each transformer block with block-wise reconstruction, reducing memory usage and improving efficiency. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further enhancing quantization performance. EfficientQAT outperforms existing methods in terms of accuracy and efficiency across various models and quantization levels. It achieves a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours with less than 3% accuracy degradation. EfficientQAT also excels in instruction tuning, outperforming existing Q-PEFT methods. The method is memory-efficient and hardware-friendly, making it suitable for deployment on memory-limited platforms. The results demonstrate that EfficientQAT significantly improves the performance of LLMs while maintaining low memory usage and training efficiency.