8 Jul 2024 | Daliang Xu, Hao Zhang, Liming Yang, Ruqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu
The paper presents m11m-NPU, a novel system for efficient on-device large language model (LLM) inference, focusing on reducing inference latency and energy consumption. On-device LLMs are gaining traction due to their privacy-preserving capabilities, but they often suffer from long inference latencies, particularly in the prefill stage, which is crucial for tasks requiring long context. m11m-NPU leverages the Neural Processing Unit (NPU) available in modern mobile devices to offload LLM inference, addressing the gap between LLM architecture and NPU design. The system re-constructs prompts and models at three levels: prompt level, tensor level, and block level, to maximize integer computation on the NPU while keeping essential floating-point operations on the CPU/GPU. Key techniques include chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution. Experiments on various mobile-sized LLMs and benchmarks show that m11m-NPU achieves significant improvements in prefill speed and energy efficiency, with up to 32.8× speedup in end-to-end real-world applications. This work paves the way for practical on-device LLMs, enabling applications like UI task automation and personalized email auto-reply.The paper presents m11m-NPU, a novel system for efficient on-device large language model (LLM) inference, focusing on reducing inference latency and energy consumption. On-device LLMs are gaining traction due to their privacy-preserving capabilities, but they often suffer from long inference latencies, particularly in the prefill stage, which is crucial for tasks requiring long context. m11m-NPU leverages the Neural Processing Unit (NPU) available in modern mobile devices to offload LLM inference, addressing the gap between LLM architecture and NPU design. The system re-constructs prompts and models at three levels: prompt level, tensor level, and block level, to maximize integer computation on the NPU while keeping essential floating-point operations on the CPU/GPU. Key techniques include chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution. Experiments on various mobile-sized LLMs and benchmarks show that m11m-NPU achieves significant improvements in prefill speed and energy efficiency, with up to 32.8× speedup in end-to-end real-world applications. This work paves the way for practical on-device LLMs, enabling applications like UI task automation and personalized email auto-reply.