Extending Llama-3's Context Ten-Fold Overnight

Extending Llama-3's Context Ten-Fold Overnight

2024-04-30 | Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou
This paper presents a method to significantly extend the context length of the Llama-3-8B-Instruct model from 8K to 80K tokens using efficient fine-tuning with QLoRA. The training process is highly efficient, taking only 8 hours on an 8xA800 (80G) GPU machine. The model is trained using 3.5K synthetic training samples generated by GPT-4, which covers three long-context tasks: single-detail QA, multi-detail QA, and biography summarization. The training data is synthesized by slicing long contexts and prompting GPT-4 to generate question-answer pairs. The model is then fine-tuned using QLoRA, which involves LoRA on all Q,K,V,O projections and training the embedding layer. The model achieves 100% accuracy on the Needle-In-A-Haystack task and performs well on various long-context benchmarks, including LongBench and InfBench. It outperforms the original Llama-3-8B-Instruct and other models on most tasks, although it slightly underperforms on code completion. The model also shows strong performance on zero-shot tasks like MMLU. The team has made the entire training resources publicly available to facilitate future research. The results demonstrate that extending the context length of LLMs is feasible with relatively few resources, highlighting the inherent potential of large language models to extend their original context length.This paper presents a method to significantly extend the context length of the Llama-3-8B-Instruct model from 8K to 80K tokens using efficient fine-tuning with QLoRA. The training process is highly efficient, taking only 8 hours on an 8xA800 (80G) GPU machine. The model is trained using 3.5K synthetic training samples generated by GPT-4, which covers three long-context tasks: single-detail QA, multi-detail QA, and biography summarization. The training data is synthesized by slicing long contexts and prompting GPT-4 to generate question-answer pairs. The model is then fine-tuned using QLoRA, which involves LoRA on all Q,K,V,O projections and training the embedding layer. The model achieves 100% accuracy on the Needle-In-A-Haystack task and performs well on various long-context benchmarks, including LongBench and InfBench. It outperforms the original Llama-3-8B-Instruct and other models on most tasks, although it slightly underperforms on code completion. The model also shows strong performance on zero-shot tasks like MMLU. The team has made the entire training resources publicly available to facilitate future research. The results demonstrate that extending the context length of LLMs is feasible with relatively few resources, highlighting the inherent potential of large language models to extend their original context length.
Reach us at info@study.space