PowerInfer-2: Fast Large Language Model Inference on a Smartphone

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

12 Jun 2024 | Zhenliang Xue*, Yixin Song*, Zeyu Mi**, Le Chen, Yubin Xia, and Haibo Chen
PowerInfer-2 is a framework for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models exceeding the device's memory capacity. The key idea is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. It also introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining to minimize I/O overhead. Implementation and evaluation show PowerInfer-2 can support a wide range of LLM models on two smartphones, achieving up to 29.2× speed increase compared to state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. PowerInfer-2 is designed for smartphones, leveraging dynamic sparse activation inherent in LLM inference. It breaks down coarse-grained matrix computations into fine-grained neuron cluster computations, using a polymorphic neuron engine for prefill and decoding stages. It also introduces a segmented cache and pipelining technique to reduce I/O overhead. PowerInfer-2 is implemented by extending PowerInfer with 12K lines of code and deployed on two smartphones. It supports various LLMs, including Llama-2, TurboSparse-Mistral, and TurboSparse-Mixtral. Evaluation shows PowerInfer-2 achieves an average speedup of 3.94× (up to 4.38×) and 25.4× (up to 29.2×) compared to LLM in a Flash and llama.cpp. PowerInfer-2 is the first system to support the TurboSparse-Mixtral-47B model on mobile platforms, achieving a generation speed of 11.68 tokens/s, which is 21.2× faster than llama.cpp. PowerInfer-2 also reduces memory usage during model inference, saving nearly 40% of memory usage for smaller models while achieving the same inference speed as llama.cpp and MLC-LLM. PowerInfer-2 is designed to optimize LLM inference on smartphones by leveraging heterogeneous computing resources, including CPU, GPU, and NPU. It uses a polymorphic neuron engine, segmented cache, and pipelining to reduce I/O overhead and improve inference speed. The framework is implemented on Android and is portable to other operating systems. PowerInfer-2 supports a diverse array of LLMs with varying model sizes, including LPowerInfer-2 is a framework for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models exceeding the device's memory capacity. The key idea is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. It also introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining to minimize I/O overhead. Implementation and evaluation show PowerInfer-2 can support a wide range of LLM models on two smartphones, achieving up to 29.2× speed increase compared to state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. PowerInfer-2 is designed for smartphones, leveraging dynamic sparse activation inherent in LLM inference. It breaks down coarse-grained matrix computations into fine-grained neuron cluster computations, using a polymorphic neuron engine for prefill and decoding stages. It also introduces a segmented cache and pipelining technique to reduce I/O overhead. PowerInfer-2 is implemented by extending PowerInfer with 12K lines of code and deployed on two smartphones. It supports various LLMs, including Llama-2, TurboSparse-Mistral, and TurboSparse-Mixtral. Evaluation shows PowerInfer-2 achieves an average speedup of 3.94× (up to 4.38×) and 25.4× (up to 29.2×) compared to LLM in a Flash and llama.cpp. PowerInfer-2 is the first system to support the TurboSparse-Mixtral-47B model on mobile platforms, achieving a generation speed of 11.68 tokens/s, which is 21.2× faster than llama.cpp. PowerInfer-2 also reduces memory usage during model inference, saving nearly 40% of memory usage for smaller models while achieving the same inference speed as llama.cpp and MLC-LLM. PowerInfer-2 is designed to optimize LLM inference on smartphones by leveraging heterogeneous computing resources, including CPU, GPU, and NPU. It uses a polymorphic neuron engine, segmented cache, and pipelining to reduce I/O overhead and improve inference speed. The framework is implemented on Android and is portable to other operating systems. PowerInfer-2 supports a diverse array of LLMs with varying model sizes, including L
Reach us at info@study.space
[slides and audio] PowerInfer-2%3A Fast Large Language Model Inference on a Smartphone