MobileLLM is a compact language model designed for efficient on-device use, achieving significant performance improvements over existing sub-billion parameter models. The paper addresses the need for efficient large language models (LLMs) on mobile devices due to increasing cloud costs and latency concerns. MobileLLM, with fewer than a billion parameters, is optimized for mobile deployment by leveraging deep and thin architectures, embedding sharing, and grouped-query attention mechanisms. It achieves a 2.7%/4.3% accuracy boost over previous 125M/350M state-of-the-art models. Additionally, MobileLLM-LS, with block-wise weight sharing, further improves accuracy by 0.7%/0.8%. MobileLLM outperforms previous sub-billion models on chat benchmarks and matches LLaMA-v2 7B in API calling tasks, demonstrating the capability of small models for common on-device use cases.
The paper investigates the importance of model architecture over parameter quantity for sub-billion scale LLMs. It proposes techniques such as SwiGLU feed-forward networks, deep and thin architectures, embedding sharing, and grouped-query attention to enhance performance. Layer sharing is also introduced to increase hidden layers without additional memory overhead. MobileLLM-LS achieves a 0.7 point accuracy improvement over MobileLLM-125M/350M. The model family shows significant improvements on chat and API calling tasks, with MobileLLM-350M achieving comparable accuracy to LLaMA-v2 7B.
MobileLLM is compatible with quantization and demonstrates strong performance in on-device applications. It is evaluated on various tasks, including zero-shot common sense reasoning, question answering, and reading comprehension. MobileLLM-125M outperforms previous models in these tasks, achieving higher accuracy than larger models. The model is also effective in chat and API calling tasks, with MobileLLM-LS-350M achieving a 48.2% win rate in chat benchmarks.
The paper also explores the impact of model size on performance, showing that deeper and thinner models outperform wider and shallower ones. MobileLLM-LS, with layer sharing, achieves a 2.2% increase in loading and initialization time but only a 2.6% overhead in execution time. The model is efficient in terms of memory usage and can be deployed on mobile devices with limited memory.
The study highlights the effectiveness of MobileLLM in on-device applications, demonstrating that smaller models can achieve performance comparable to larger ones. The paper contributes to the field of model compression and efficient LLM design, showing that sub-billion parameter models can be optimized for mobile deployment without sacrificing performance. MobileLLM is a significant advancement in the development of compact and efficient language models for on-device use.MobileLLM is a compact language model designed for efficient on-device use, achieving significant performance improvements over existing sub-billion parameter models. The paper addresses the need for efficient large language models (LLMs) on mobile devices due to increasing cloud costs and latency concerns. MobileLLM, with fewer than a billion parameters, is optimized for mobile deployment by leveraging deep and thin architectures, embedding sharing, and grouped-query attention mechanisms. It achieves a 2.7%/4.3% accuracy boost over previous 125M/350M state-of-the-art models. Additionally, MobileLLM-LS, with block-wise weight sharing, further improves accuracy by 0.7%/0.8%. MobileLLM outperforms previous sub-billion models on chat benchmarks and matches LLaMA-v2 7B in API calling tasks, demonstrating the capability of small models for common on-device use cases.
The paper investigates the importance of model architecture over parameter quantity for sub-billion scale LLMs. It proposes techniques such as SwiGLU feed-forward networks, deep and thin architectures, embedding sharing, and grouped-query attention to enhance performance. Layer sharing is also introduced to increase hidden layers without additional memory overhead. MobileLLM-LS achieves a 0.7 point accuracy improvement over MobileLLM-125M/350M. The model family shows significant improvements on chat and API calling tasks, with MobileLLM-350M achieving comparable accuracy to LLaMA-v2 7B.
MobileLLM is compatible with quantization and demonstrates strong performance in on-device applications. It is evaluated on various tasks, including zero-shot common sense reasoning, question answering, and reading comprehension. MobileLLM-125M outperforms previous models in these tasks, achieving higher accuracy than larger models. The model is also effective in chat and API calling tasks, with MobileLLM-LS-350M achieving a 48.2% win rate in chat benchmarks.
The paper also explores the impact of model size on performance, showing that deeper and thinner models outperform wider and shallower ones. MobileLLM-LS, with layer sharing, achieves a 2.2% increase in loading and initialization time but only a 2.6% overhead in execution time. The model is efficient in terms of memory usage and can be deployed on mobile devices with limited memory.
The study highlights the effectiveness of MobileLLM in on-device applications, demonstrating that smaller models can achieve performance comparable to larger ones. The paper contributes to the field of model compression and efficient LLM design, showing that sub-billion parameter models can be optimized for mobile deployment without sacrificing performance. MobileLLM is a significant advancement in the development of compact and efficient language models for on-device use.