[slides and audio] Tender%3A Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

This paper presents Tender, an algorithm-hardware co-design solution for efficient deployment of large language models (LLMs) at low precision. LLMs, known for their superior performance in various machine learning tasks, pose significant challenges due to their large model sizes and high computational and memory requirements. Tender addresses these challenges by proposing a decomposed quantization technique that splits activation tensors into multiple subtensors based on the scale factors being powers of two. This approach avoids explicit requantization during the accumulation of partial sums from matrix multiplications, reducing overhead and improving performance. The evaluation shows that Tender achieves higher accuracy and inference performance compared to state-of-the-art methods while being less intrusive to existing accelerators. The paper also discusses the hardware architecture of Tender, which includes a Multi-Scale Systolic Array (MSA) and a Vector Processing Unit (VPU) for efficient matrix multiplication and quantization. Experimental results on various LLMs demonstrate the effectiveness of Tender, showing improved perplexity and accuracy in both INT8 and INT4 quantization settings.This paper presents Tender, an algorithm-hardware co-design solution for efficient deployment of large language models (LLMs) at low precision. LLMs, known for their superior performance in various machine learning tasks, pose significant challenges due to their large model sizes and high computational and memory requirements. Tender addresses these challenges by proposing a decomposed quantization technique that splits activation tensors into multiple subtensors based on the scale factors being powers of two. This approach avoids explicit requantization during the accumulation of partial sums from matrix multiplications, reducing overhead and improving performance. The evaluation shows that Tender achieves higher accuracy and inference performance compared to state-of-the-art methods while being less intrusive to existing accelerators. The paper also discusses the hardware architecture of Tender, which includes a Multi-Scale Systolic Array (MSA) and a Vector Processing Unit (VPU) for efficient matrix multiplication and quantization. Experimental results on various LLMs demonstrate the effectiveness of Tender, showing improved perplexity and accuracy in both INT8 and INT4 quantization settings.

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

16 Jun 2024 | Jungi Lee, Wonbeom Lee, Jaewoong Sim

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

16 Jun 2024 | Jungi Lee*, Wonbeom Lee*, Jaewoong Sim

16 Jun 2024 | Jungi Lee, Wonbeom Lee, Jaewoong Sim