16 Jun 2024 | Jungi Lee*, Wonbeom Lee*, Jaewoong Sim
Tender is an algorithm-hardware co-design solution that enables efficient deployment of large language model (LLM) inference at low precision. The approach leverages tensor decomposition and runtime requantization to reduce quantization error and improve inference performance. By decomposing activation tensors into subtensors along the feature/channel dimensions, Tender isolates outlier channels, allowing for different scale factors to be applied to each subtensor. This approach minimizes the need for explicit requantization during matrix multiplication, as the scale factors are set with power-of-two relationships, enabling efficient rescaling using simple shifter logic in tensor compute units.
Tender achieves higher accuracy and inference performance compared to state-of-the-art methods while being significantly less intrusive to existing accelerators. The algorithm is implemented in software and can be further optimized with a custom accelerator design that supports implicit requantization. Evaluation on three representative LLMs shows that Tender achieves better model performance than the state-of-the-art in INT8 quantization and outperforms other outlier-aware post-training quantization (PTQ) techniques in INT4 quantization. The Tender hardware achieves up to an average of 2.63× speedup over outlier-aware accelerators.
Tender's key contributions include a "power of 2" channel decomposition rule that effectively reduces quantization error by harmonically working with LLM activation tensors, and a Tender accelerator design that enables implicit/runtime requantization with minimal hardware extension. The algorithm is implemented in software and can be extended to other bit widths with minimal modifications. The hardware architecture of Tender includes a multi-scale systolic array (MSA) and vector processing unit (VPU), which are optimized for efficient matrix multiplication and rescaling. Evaluation shows that Tender achieves high performance and accuracy without the need for mixed-precision compute units or custom datatypes, making it a flexible and practical solution for LLM inference.Tender is an algorithm-hardware co-design solution that enables efficient deployment of large language model (LLM) inference at low precision. The approach leverages tensor decomposition and runtime requantization to reduce quantization error and improve inference performance. By decomposing activation tensors into subtensors along the feature/channel dimensions, Tender isolates outlier channels, allowing for different scale factors to be applied to each subtensor. This approach minimizes the need for explicit requantization during matrix multiplication, as the scale factors are set with power-of-two relationships, enabling efficient rescaling using simple shifter logic in tensor compute units.
Tender achieves higher accuracy and inference performance compared to state-of-the-art methods while being significantly less intrusive to existing accelerators. The algorithm is implemented in software and can be further optimized with a custom accelerator design that supports implicit requantization. Evaluation on three representative LLMs shows that Tender achieves better model performance than the state-of-the-art in INT8 quantization and outperforms other outlier-aware post-training quantization (PTQ) techniques in INT4 quantization. The Tender hardware achieves up to an average of 2.63× speedup over outlier-aware accelerators.
Tender's key contributions include a "power of 2" channel decomposition rule that effectively reduces quantization error by harmonically working with LLM activation tensors, and a Tender accelerator design that enables implicit/runtime requantization with minimal hardware extension. The algorithm is implemented in software and can be extended to other bit widths with minimal modifications. The hardware architecture of Tender includes a multi-scale systolic array (MSA) and vector processing unit (VPU), which are optimized for efficient matrix multiplication and rescaling. Evaluation shows that Tender achieves high performance and accuracy without the need for mixed-precision compute units or custom datatypes, making it a flexible and practical solution for LLM inference.