12 Jun 2024 | Zhenliang Xue*, Yixin Song*, Zeyu Mi*, Le Chen, Yubin Xia, and Haibo Chen
This paper introduces *PowerInfer-2*, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference, introduces segmented neuron caching, and proposes fine-grained neuron-cluster-level pipelining to minimize I/O overhead. The implementation and evaluation demonstrate that PowerInfer-2 can support a wide array of LLM models on two smartphones, achieving up to a 29.2× speed increase compared to state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.This paper introduces *PowerInfer-2*, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference, introduces segmented neuron caching, and proposes fine-grained neuron-cluster-level pipelining to minimize I/O overhead. The implementation and evaluation demonstrate that PowerInfer-2 can support a wide array of LLM models on two smartphones, achieving up to a 29.2× speed increase compared to state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.