LLM as a System Service on Mobile Devices

LLM as a System Service on Mobile Devices

18 Mar 2024 | Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu
This paper introduces a new paradigm for mobile AI: Large Language Models (LLMs) as a system service on mobile devices (LLMaaS). Unlike traditional DNNs, LLMs require persistent states, primarily in the form of Key-Value (KV) cache, which need to be managed across multiple invocations. To address this challenge, the authors propose LLMS, a system that decouples the memory management of app and LLM contexts, focusing on fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. Key contributions include: 1. **Tolerance-Aware Compression**: Compresses chunks based on their measured accuracy tolerance to compression. 2. **IO-Recompute Pipelined Loading**: Introduces recomputation to accelerate loading from disk. 3. **Chunk Lifecycle Management**: Optimizes memory activities with an ahead-of-time swapping-out approach and an LCTRU queue-based eviction policy. Evaluations on various edge devices and datasets show that LLMS reduces context switching latency by up to 2 orders of magnitude compared to baseline solutions, demonstrating its effectiveness in managing LLM contexts efficiently. The paper also discusses the motivation behind LLMaaS, the challenges in managing LLM contexts, and the design and implementation details of LLMS.This paper introduces a new paradigm for mobile AI: Large Language Models (LLMs) as a system service on mobile devices (LLMaaS). Unlike traditional DNNs, LLMs require persistent states, primarily in the form of Key-Value (KV) cache, which need to be managed across multiple invocations. To address this challenge, the authors propose LLMS, a system that decouples the memory management of app and LLM contexts, focusing on fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. Key contributions include: 1. **Tolerance-Aware Compression**: Compresses chunks based on their measured accuracy tolerance to compression. 2. **IO-Recompute Pipelined Loading**: Introduces recomputation to accelerate loading from disk. 3. **Chunk Lifecycle Management**: Optimizes memory activities with an ahead-of-time swapping-out approach and an LCTRU queue-based eviction policy. Evaluations on various edge devices and datasets show that LLMS reduces context switching latency by up to 2 orders of magnitude compared to baseline solutions, demonstrating its effectiveness in managing LLM contexts efficiently. The paper also discusses the motivation behind LLMaaS, the challenges in managing LLM contexts, and the design and implementation details of LLMS.
Reach us at info@study.space