April 27-May 1, 2024 | Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, Jongse Park
The paper introduces NeuPIMs, a heterogeneous acceleration system designed to improve the inference performance of batched Large Language Models (LLMs). LLMs, such as those based on transformers, involve complex computational tasks, including General Matrix Multiplications (GEMMs) and General Matrix-Vector Multiplications (GEMVs). While Neural Processing Units (NPUs) excel at GEMMs, they are less efficient for GEMVs, and Processing-in-Memory (PIM) technology is optimized for GEMVs but struggles with GEMMs. NeuPIMs addresses these challenges by integrating a conventional GEMM-focused NPU with a GEMV-optimized PIM system. The main contributions of NeuPIMs include:
1. **Dual Row Buffers**: Introduced to enable concurrent memory read/write operations and PIM commands, allowing the NPU and PIM to work simultaneously.
2. **Sub-Batch Interleaving**: A runtime technique that partitions a batch into two sub-batches, allowing independent execution of GEMM and GEMV operations, maximizing parallelism and resource utilization.
The evaluation demonstrates that NeuPIMs achieves significant throughput improvements compared to GPU-only, NPU-only, and naive NPU+PIM integrated systems. Specifically, NeuPIMs shows 3×, 2.4×, and 1.6× throughput improvements over the respective baselines. The system also demonstrates high resource utilization, with NPU and PIM utilization rates of 65% and 26%, respectively, compared to 28% and 17% for the baselines. The effectiveness of NeuPIMs is validated using real-world datasets and models, highlighting its potential for practical deployment in large-scale inference serving systems.The paper introduces NeuPIMs, a heterogeneous acceleration system designed to improve the inference performance of batched Large Language Models (LLMs). LLMs, such as those based on transformers, involve complex computational tasks, including General Matrix Multiplications (GEMMs) and General Matrix-Vector Multiplications (GEMVs). While Neural Processing Units (NPUs) excel at GEMMs, they are less efficient for GEMVs, and Processing-in-Memory (PIM) technology is optimized for GEMVs but struggles with GEMMs. NeuPIMs addresses these challenges by integrating a conventional GEMM-focused NPU with a GEMV-optimized PIM system. The main contributions of NeuPIMs include:
1. **Dual Row Buffers**: Introduced to enable concurrent memory read/write operations and PIM commands, allowing the NPU and PIM to work simultaneously.
2. **Sub-Batch Interleaving**: A runtime technique that partitions a batch into two sub-batches, allowing independent execution of GEMM and GEMV operations, maximizing parallelism and resource utilization.
The evaluation demonstrates that NeuPIMs achieves significant throughput improvements compared to GPU-only, NPU-only, and naive NPU+PIM integrated systems. Specifically, NeuPIMs shows 3×, 2.4×, and 1.6× throughput improvements over the respective baselines. The system also demonstrates high resource utilization, with NPU and PIM utilization rates of 65% and 26%, respectively, compared to 28% and 17% for the baselines. The effectiveness of NeuPIMs is validated using real-world datasets and models, highlighting its potential for practical deployment in large-scale inference serving systems.