Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

6 May 2024 | Abhinav Agarwalla*1, Abhay Gupta*2, Alexandre Marques*1, Shubhra Pandit*1, Michael Goin1, Eldar Kurtic1, Kevin Leong2, Tuan Nguyen1, Mahmoud Salem2, Dan Alistarh1,3, Sean Lie2, Mark Kurtz1
This paper introduces a novel approach to create accurate, sparse foundational versions of large language models (LLMs) that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. The authors combine the SparseGPT one-shot pruning method with sparse pretraining on a subset of the SlimPajama dataset and a Python subset of The Stack dataset. They demonstrate training acceleration on Cerebras CS-3 chips and inference acceleration on CPUs and GPUs using Neural Magic’s DeepSparse and nm-vm engines, respectively. The results show significant speedups and maintain high accuracy, with further gains through quantization. The sparse foundational models are evaluated on diverse tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization, proving their generality and effectiveness. The work paves the way for creating smaller, faster, and more accessible LLMs without sacrificing accuracy. The code and documentation are open-sourced to promote reproducibility and further research.This paper introduces a novel approach to create accurate, sparse foundational versions of large language models (LLMs) that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. The authors combine the SparseGPT one-shot pruning method with sparse pretraining on a subset of the SlimPajama dataset and a Python subset of The Stack dataset. They demonstrate training acceleration on Cerebras CS-3 chips and inference acceleration on CPUs and GPUs using Neural Magic’s DeepSparse and nm-vm engines, respectively. The results show significant speedups and maintain high accuracy, with further gains through quantization. The sparse foundational models are evaluated on diverse tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization, proving their generality and effectiveness. The work paves the way for creating smaller, faster, and more accessible LLMs without sacrificing accuracy. The code and documentation are open-sourced to promote reproducibility and further research.
Reach us at info@study.space
[slides] Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment | StudySpace