Understanding MobileLLM%3A Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. The authors focus on designing high-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to the prevailing belief that data and parameter quantity are crucial for model quality, the study emphasizes the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, along with embedding sharing and grouped-query attention mechanisms, the authors establish a strong baseline network named MobileLLM, which achieves a significant accuracy boost over preceding state-of-the-art models. Additionally, they propose an immediate block-wise weight-sharing approach, resulting in further accuracy enhancements without increasing model size or significantly increasing latency. The MobileLLM model family demonstrates significant improvements on chat benchmarks and shows close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases. The paper contributes to the design and implementation of LLMs with parameters less than 1 billion, demonstrating that depth is more important than width for small LLMs and that deep-and-thin models excel in capturing abstract concepts. The MobileLLM-LS model, which incorporates layer sharing, further boosts accuracy without incurring additional memory overhead. The study also explores the impact of various architectural choices, including feed-forward network choice, architecture depth versus width, embedding sharing, and number of heads and key-value heads, providing insights into optimizing model performance within storage constraints. The MobileLLM models are evaluated on zero-shot common sense reasoning tasks, question answering, and reading comprehension tasks, showing superior performance compared to previous state-of-the-art models. The effectiveness of the MobileLLM models in on-device applications, such as chat and API calling, is also demonstrated, underscoring their adeptness in handling such tasks.This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. The authors focus on designing high-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to the prevailing belief that data and parameter quantity are crucial for model quality, the study emphasizes the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, along with embedding sharing and grouped-query attention mechanisms, the authors establish a strong baseline network named MobileLLM, which achieves a significant accuracy boost over preceding state-of-the-art models. Additionally, they propose an immediate block-wise weight-sharing approach, resulting in further accuracy enhancements without increasing model size or significantly increasing latency. The MobileLLM model family demonstrates significant improvements on chat benchmarks and shows close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases. The paper contributes to the design and implementation of LLMs with parameters less than 1 billion, demonstrating that depth is more important than width for small LLMs and that deep-and-thin models excel in capturing abstract concepts. The MobileLLM-LS model, which incorporates layer sharing, further boosts accuracy without incurring additional memory overhead. The study also explores the impact of various architectural choices, including feed-forward network choice, architecture depth versus width, embedding sharing, and number of heads and key-value heads, providing insights into optimizing model performance within storage constraints. The MobileLLM models are evaluated on zero-shot common sense reasoning tasks, question answering, and reading comprehension tasks, showing superior performance compared to previous state-of-the-art models. The effectiveness of the MobileLLM models in on-device applications, such as chat and API calling, is also demonstrated, underscoring their adeptness in handling such tasks.

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

2024 | Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra