April 27-May 1, 2024 | Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, Jongse Park
NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
Modern transformer-based Large Language Models (LLMs) are composed of decoder blocks with three key components: QKV generation, multi-head attention (MHA), and feed-forward networks (FFNs). In batched processing, QKV generation and FFNs involve compute-intensive matrix-matrix multiplications (GEMM), while MHA requires bandwidth-heavy matrix-vector multiplications (GEMV). NPUs are efficient for GEMM but less so for GEMV, while PIM is optimized for GEMV but lacks GEMM capability. NeuPIMs is a heterogeneous system combining a GEMM-focused NPU and GEMV-optimized PIM devices to improve throughput. The main challenges are enabling concurrent operations on both platforms and managing dependencies between GEMM and GEMV. NeuPIMs addresses these by introducing dual row buffers for concurrent memory access and PIM commands, and a sub-batch interleaving technique to maximize concurrent execution. Evaluation shows NeuPIMs achieves 3×, 2.4×, and 1.6× throughput improvements over GPU-only, NPU-only, and naive NPU+PIM systems, respectively. The system is designed for batched LLM inference, with NPU handling GEMM and PIM handling GEMV. NeuPIMs uses a modified PIM bank architecture with dual row buffers to enable concurrent memory access and PIM commands, and a sub-batch interleaving technique to allow two independent sub-batches to be pipelined. The system is evaluated using GPT3 variants and real-world datasets, showing significant improvements in resource utilization and throughput. NeuPIMs is a heterogeneous system combining a GEMM-centric NPU and GEMV-optimized PIM devices, with a focus on improving throughput through concurrent execution and efficient resource utilization. The system is designed for batched LLM inference, with NPU handling GEMM and PIM handling GEMV. NeuPIMs uses a modified PIM bank architecture with dual row buffers to enable concurrent memory access and PIM commands, and a sub-batch interleaving technique to allow two independent sub-batches to be pipelined. The system is evaluated using GPT3 variants and real-world datasets, showing significant improvements in resource utilization and throughput.NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
Modern transformer-based Large Language Models (LLMs) are composed of decoder blocks with three key components: QKV generation, multi-head attention (MHA), and feed-forward networks (FFNs). In batched processing, QKV generation and FFNs involve compute-intensive matrix-matrix multiplications (GEMM), while MHA requires bandwidth-heavy matrix-vector multiplications (GEMV). NPUs are efficient for GEMM but less so for GEMV, while PIM is optimized for GEMV but lacks GEMM capability. NeuPIMs is a heterogeneous system combining a GEMM-focused NPU and GEMV-optimized PIM devices to improve throughput. The main challenges are enabling concurrent operations on both platforms and managing dependencies between GEMM and GEMV. NeuPIMs addresses these by introducing dual row buffers for concurrent memory access and PIM commands, and a sub-batch interleaving technique to maximize concurrent execution. Evaluation shows NeuPIMs achieves 3×, 2.4×, and 1.6× throughput improvements over GPU-only, NPU-only, and naive NPU+PIM systems, respectively. The system is designed for batched LLM inference, with NPU handling GEMM and PIM handling GEMV. NeuPIMs uses a modified PIM bank architecture with dual row buffers to enable concurrent memory access and PIM commands, and a sub-batch interleaving technique to allow two independent sub-batches to be pipelined. The system is evaluated using GPT3 variants and real-world datasets, showing significant improvements in resource utilization and throughput. NeuPIMs is a heterogeneous system combining a GEMM-centric NPU and GEMV-optimized PIM devices, with a focus on improving throughput through concurrent execution and efficient resource utilization. The system is designed for batched LLM inference, with NPU handling GEMM and PIM handling GEMV. NeuPIMs uses a modified PIM bank architecture with dual row buffers to enable concurrent memory access and PIM commands, and a sub-batch interleaving technique to allow two independent sub-batches to be pipelined. The system is evaluated using GPT3 variants and real-world datasets, showing significant improvements in resource utilization and throughput.