LLM as a System Service on Mobile Devices

LLM as a System Service on Mobile Devices

2017 | Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu
This paper proposes LLMS, a novel system service for mobile devices that enables Large Language Models (LLMs) to run as a system service, reducing context switching latency and improving memory efficiency. Unlike traditional DNNs, LLMs require persistent state management, particularly for the Key-Value (KV) cache, which is crucial for maintaining context across multiple invocations. LLMS addresses this challenge by introducing three key techniques: (1) Tolerance-Aware Compression, which compresses KV cache chunks based on their accuracy tolerance; (2) Swapping-Recompute Pipeline, which leverages recompute and I/O to accelerate context switching; and (3) Chunk Lifecycle Management, which optimizes memory usage through an LCTRU (Least Compression-Tolerable and Recently-Used) queue-based eviction strategy. LLMS decouples the memory management of LLM contexts from the app, allowing for efficient chunk-wise compression and swapping. By splitting the KV cache into chunks, LLMS enables independent compression and swapping, reducing memory overhead and improving context switching performance. The system is evaluated on various edge devices and shows significant improvements in context switching latency, reducing it by up to 2 orders of magnitude compared to competitive baselines. LLMS also achieves better memory efficiency by minimizing the memory footprint of LLM contexts, which is critical for mobile devices with limited memory. The system is implemented on three representative mobile/edge devices and demonstrates effective performance across different LLMs and context switching patterns. The results show that LLMS significantly outperforms existing methods in terms of context switching latency and memory efficiency, making it a promising solution for on-device LLM execution.This paper proposes LLMS, a novel system service for mobile devices that enables Large Language Models (LLMs) to run as a system service, reducing context switching latency and improving memory efficiency. Unlike traditional DNNs, LLMs require persistent state management, particularly for the Key-Value (KV) cache, which is crucial for maintaining context across multiple invocations. LLMS addresses this challenge by introducing three key techniques: (1) Tolerance-Aware Compression, which compresses KV cache chunks based on their accuracy tolerance; (2) Swapping-Recompute Pipeline, which leverages recompute and I/O to accelerate context switching; and (3) Chunk Lifecycle Management, which optimizes memory usage through an LCTRU (Least Compression-Tolerable and Recently-Used) queue-based eviction strategy. LLMS decouples the memory management of LLM contexts from the app, allowing for efficient chunk-wise compression and swapping. By splitting the KV cache into chunks, LLMS enables independent compression and swapping, reducing memory overhead and improving context switching performance. The system is evaluated on various edge devices and shows significant improvements in context switching latency, reducing it by up to 2 orders of magnitude compared to competitive baselines. LLMS also achieves better memory efficiency by minimizing the memory footprint of LLM contexts, which is critical for mobile devices with limited memory. The system is implemented on three representative mobile/edge devices and demonstrates effective performance across different LLMs and context switching patterns. The results show that LLMS significantly outperforms existing methods in terms of context switching latency and memory efficiency, making it a promising solution for on-device LLM execution.
Reach us at info@study.space