[slides and audio] FlightLLM%3A Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

FlightLLM is a novel FPGA-based accelerator designed to efficiently process large language models (LLMs) by leveraging FPGA-specific resources such as DSP48 and heterogeneous memory hierarchies. The paper addresses the challenges of low computational efficiency, underutilized memory bandwidth, and large compilation overheads in current GPU and transformer-based accelerators. Key contributions include: 1. **Configurable Sparse DSP Chain**: A flexible cascaded DSP48 architecture supports different sparsity patterns, improving computation efficiency by 1.6× with block-wise and N:M sparsity. 2. **Always-On-Chip Decode Scheme**: This scheme boosts memory bandwidth by 35.6% to 65.9% by keeping activations in on-chip memory during the decode stage with mixed-precision support. 3. **Length Adaptive Compilation Method**: This method reduces instruction storage overhead by 500×, enabling real-world LLMs to be deployed on FPGAs. flightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency compared to commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) under batch size one. It also outperforms NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA. The paper evaluates FlightLLM on Xilinx Alveo U280 and Versal VHK158 FPGAs, demonstrating its superior performance and efficiency in various scenarios.FlightLLM is a novel FPGA-based accelerator designed to efficiently process large language models (LLMs) by leveraging FPGA-specific resources such as DSP48 and heterogeneous memory hierarchies. The paper addresses the challenges of low computational efficiency, underutilized memory bandwidth, and large compilation overheads in current GPU and transformer-based accelerators. Key contributions include: 1. **Configurable Sparse DSP Chain**: A flexible cascaded DSP48 architecture supports different sparsity patterns, improving computation efficiency by 1.6× with block-wise and N:M sparsity. 2. **Always-On-Chip Decode Scheme**: This scheme boosts memory bandwidth by 35.6% to 65.9% by keeping activations in on-chip memory during the decode stage with mixed-precision support. 3. **Length Adaptive Compilation Method**: This method reduces instruction storage overhead by 500×, enabling real-world LLMs to be deployed on FPGAs. flightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency compared to commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) under batch size one. It also outperforms NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA. The paper evaluates FlightLLM on Xilinx Alveo U280 and Versal VHK158 FPGAs, demonstrating its superior performance and efficiency in various scenarios.

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

March 3–5, 2024 | Shulin Zeng, Jun Liu, Guohao Dai†, Xinshao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang†

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

March 3–5, 2024 | Shulin Zeng*, Jun Liu*, Guohao Dai†, Xinshao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang†

March 3–5, 2024 | Shulin Zeng, Jun Liu, Guohao Dai†, Xinshao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang†