FlightLLM is a novel FPGA-based accelerator designed to efficiently process large language models (LLMs) by leveraging FPGA-specific resources such as DSP48 and heterogeneous memory hierarchies. The paper addresses the challenges of low computational efficiency, underutilized memory bandwidth, and large compilation overheads in current GPU and transformer-based accelerators. Key contributions include:
1. **Configurable Sparse DSP Chain**: A flexible cascaded DSP48 architecture supports different sparsity patterns, improving computation efficiency by 1.6× with block-wise and N:M sparsity.
2. **Always-On-Chip Decode Scheme**: This scheme boosts memory bandwidth by 35.6% to 65.9% by keeping activations in on-chip memory during the decode stage with mixed-precision support.
3. **Length Adaptive Compilation Method**: This method reduces instruction storage overhead by 500×, enabling real-world LLMs to be deployed on FPGAs.
flightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency compared to commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) under batch size one. It also outperforms NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA. The paper evaluates FlightLLM on Xilinx Alveo U280 and Versal VHK158 FPGAs, demonstrating its superior performance and efficiency in various scenarios.FlightLLM is a novel FPGA-based accelerator designed to efficiently process large language models (LLMs) by leveraging FPGA-specific resources such as DSP48 and heterogeneous memory hierarchies. The paper addresses the challenges of low computational efficiency, underutilized memory bandwidth, and large compilation overheads in current GPU and transformer-based accelerators. Key contributions include:
1. **Configurable Sparse DSP Chain**: A flexible cascaded DSP48 architecture supports different sparsity patterns, improving computation efficiency by 1.6× with block-wise and N:M sparsity.
2. **Always-On-Chip Decode Scheme**: This scheme boosts memory bandwidth by 35.6% to 65.9% by keeping activations in on-chip memory during the decode stage with mixed-precision support.
3. **Length Adaptive Compilation Method**: This method reduces instruction storage overhead by 500×, enabling real-world LLMs to be deployed on FPGAs.
flightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency compared to commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) under batch size one. It also outperforms NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA. The paper evaluates FlightLLM on Xilinx Alveo U280 and Versal VHK158 FPGAs, demonstrating its superior performance and efficiency in various scenarios.