FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

March 3-5, 2024 | Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang
FlightLLM is an efficient large language model (LLM) inference system that leverages FPGA-specific resources to address the computational and memory challenges of LLMs. Traditional GPU and transformer-based accelerators struggle with compressed LLMs due to low computational efficiency, underutilized memory bandwidth, and high compilation overheads. FlightLLM introduces a complete mapping flow on FPGAs to overcome these challenges. Key innovations include a configurable sparse DSP chain for efficient computation, an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support, and a length adaptive compilation method to reduce compilation overhead. FlightLLM is implemented on the Xilinx Alveo U280 FPGA and achieves 6.0× higher energy efficiency and 1.8× better cost efficiency than commercial GPUs like the NVIDIA V100S on modern LLMs such as LLaMA2-7B. It also outperforms the NVIDIA A100 GPU with 1.2× higher throughput using the Versal VHK158 FPGA. FlightLLM's architecture includes a task scheduler, memory controller, and multiple computing cores. The unified Matrix Processing Engine (MPE) handles matrix operations, including dense and sparse matrix multiplications, while the Special Function Unit (SFU) manages miscellaneous operations like softmax and layer normalization. The system also supports mixed-precision quantization and optimizes memory access using a hybrid HBM+DDR memory system. FlightLLM's software design includes an instruction set architecture (ISA) for efficient LLM inference, a length adaptive compilation method to reduce instruction storage overhead, and an analytical model for RTL generation. The system is evaluated on models like LLaMA2-7B and OPT-6.7B, demonstrating superior performance, energy efficiency, and cost efficiency compared to GPUs and other accelerators. FlightLLM achieves significant improvements in latency, throughput, and energy efficiency, making it a promising solution for efficient LLM inference on FPGAs.FlightLLM is an efficient large language model (LLM) inference system that leverages FPGA-specific resources to address the computational and memory challenges of LLMs. Traditional GPU and transformer-based accelerators struggle with compressed LLMs due to low computational efficiency, underutilized memory bandwidth, and high compilation overheads. FlightLLM introduces a complete mapping flow on FPGAs to overcome these challenges. Key innovations include a configurable sparse DSP chain for efficient computation, an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support, and a length adaptive compilation method to reduce compilation overhead. FlightLLM is implemented on the Xilinx Alveo U280 FPGA and achieves 6.0× higher energy efficiency and 1.8× better cost efficiency than commercial GPUs like the NVIDIA V100S on modern LLMs such as LLaMA2-7B. It also outperforms the NVIDIA A100 GPU with 1.2× higher throughput using the Versal VHK158 FPGA. FlightLLM's architecture includes a task scheduler, memory controller, and multiple computing cores. The unified Matrix Processing Engine (MPE) handles matrix operations, including dense and sparse matrix multiplications, while the Special Function Unit (SFU) manages miscellaneous operations like softmax and layer normalization. The system also supports mixed-precision quantization and optimizes memory access using a hybrid HBM+DDR memory system. FlightLLM's software design includes an instruction set architecture (ISA) for efficient LLM inference, a length adaptive compilation method to reduce instruction storage overhead, and an analytical model for RTL generation. The system is evaluated on models like LLaMA2-7B and OPT-6.7B, demonstrating superior performance, energy efficiency, and cost efficiency compared to GPUs and other accelerators. FlightLLM achieves significant improvements in latency, throughput, and energy efficiency, making it a promising solution for efficient LLM inference on FPGAs.
Reach us at info@study.space