8 Jul 2024 | Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu
mllm-NPU is the first LLM inference system that efficiently leverages on-device Neural Processing Units (NPUs) for large language model (LLM) prefilling. It addresses the challenges of on-device LLM inference, including long inference latency and limited parallel computing capacity of mobile CPUs/GPUs. mllm-NPU is an algorithm-system co-design that tackles semantic gaps between LLM architecture and contemporary NPU design. It re-constructs the prompt and model at three levels: (1) At prompt level, variable-length prompts are divided into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, significant outliers are identified and extracted to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, Transformer blocks are scheduled to CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, mllm-NPU achieves 22.4× faster prefill speed and 30.7× energy savings on average, and up to 32.8× speedup in an end-to-end real-world application. For the first time, mllm-NPU achieves more than 1,000 tokens/sec prefilling for a billion-sized model (Qwen1.5-1.8B), paving the way towards practical on-device LLM. The key idea is to maximize prefill execution on mobile NPUs to accelerate integer computation while keeping essential float operations on the CPU/GPU to maintain accuracy. To overcome the aforementioned challenges and enhance NPU offloading efficiency, mllm-NPU re-constructs the prompt and model at three levels: (1) At prompt level: mllm-NPU divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level: mllm-NPU identifies and extracts significant outliers to run on the CPU/GPU; (3) At block level: mllm-NPU schedules Transformer blocks to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. The corresponding novel techniques are detailed as follows: Chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution. Implementation and evaluations show that mllm-NPU significantly and consistently outperforms all baselines in terms of prefill latency and energy consumption while preserving inference accuracy. It is 7.3×-18.4× faster than baselines on CPU, and 1.3×-43.6× on GPU with a prompt length of 1024. It also achieves a 1.9×-59.5× energy reduction. mllm-NPU is the first system that achieves >1000 tokens/sec of prefill speed on COTS mobile devices for billion-sized LLMs. In end-to-end real-world applications, mllm-NPUmllm-NPU is the first LLM inference system that efficiently leverages on-device Neural Processing Units (NPUs) for large language model (LLM) prefilling. It addresses the challenges of on-device LLM inference, including long inference latency and limited parallel computing capacity of mobile CPUs/GPUs. mllm-NPU is an algorithm-system co-design that tackles semantic gaps between LLM architecture and contemporary NPU design. It re-constructs the prompt and model at three levels: (1) At prompt level, variable-length prompts are divided into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, significant outliers are identified and extracted to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, Transformer blocks are scheduled to CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, mllm-NPU achieves 22.4× faster prefill speed and 30.7× energy savings on average, and up to 32.8× speedup in an end-to-end real-world application. For the first time, mllm-NPU achieves more than 1,000 tokens/sec prefilling for a billion-sized model (Qwen1.5-1.8B), paving the way towards practical on-device LLM. The key idea is to maximize prefill execution on mobile NPUs to accelerate integer computation while keeping essential float operations on the CPU/GPU to maintain accuracy. To overcome the aforementioned challenges and enhance NPU offloading efficiency, mllm-NPU re-constructs the prompt and model at three levels: (1) At prompt level: mllm-NPU divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level: mllm-NPU identifies and extracts significant outliers to run on the CPU/GPU; (3) At block level: mllm-NPU schedules Transformer blocks to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. The corresponding novel techniques are detailed as follows: Chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution. Implementation and evaluations show that mllm-NPU significantly and consistently outperforms all baselines in terms of prefill latency and energy consumption while preserving inference accuracy. It is 7.3×-18.4× faster than baselines on CPU, and 1.3×-43.6× on GPU with a prompt length of 1024. It also achieves a 1.9×-59.5× energy reduction. mllm-NPU is the first system that achieves >1000 tokens/sec of prefill speed on COTS mobile devices for billion-sized LLMs. In end-to-end real-world applications, mllm-NPU