CLLMs: Consistency Large Language Models

CLLMs: Consistency Large Language Models

2024 | Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang
CLLLMs: Consistency Large Language Models aim to improve the efficiency of large language model (LLM) inference by refining the target LLM to consistently predict the fixed point given any state as input. This approach enables fast convergence from any state to the fixed point on a Jacobi trajectory, leading to significant improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. The method involves training the LLM to map any point on the Jacobi trajectory to the fixed point, which is analogous to consistency models used in diffusion models. The proposed method, called Consistency Large Language Models (CLLMs), achieves a 2.4× to 3.4× improvement in generation speed on benchmarks such as GSM8K, CodeSearchNet Python, and Spider, with minimal performance degradation. CLLMs are trained using two loss terms: a consistency loss that ensures the model maps intermediate states to the fixed point and an AR loss that maintains generation quality. The method is efficient, requiring only moderate fine-tuning costs and can be integrated with other techniques for efficient LLM inference. CLLMs demonstrate significant acceleration through fast-forwarding and stationary tokens, allowing for faster convergence in Jacobi decoding. The approach is adaptable and does not require additional model components or auxiliary components, making it more memory-efficient and suitable for a wide range of applications. The results show that CLLMs outperform existing methods such as speculative decoding and Medusa in terms of speed and efficiency, with minimal memory consumption and high performance. The method is effective across various tasks and domains, demonstrating its potential for improving LLM inference efficiency.CLLLMs: Consistency Large Language Models aim to improve the efficiency of large language model (LLM) inference by refining the target LLM to consistently predict the fixed point given any state as input. This approach enables fast convergence from any state to the fixed point on a Jacobi trajectory, leading to significant improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. The method involves training the LLM to map any point on the Jacobi trajectory to the fixed point, which is analogous to consistency models used in diffusion models. The proposed method, called Consistency Large Language Models (CLLMs), achieves a 2.4× to 3.4× improvement in generation speed on benchmarks such as GSM8K, CodeSearchNet Python, and Spider, with minimal performance degradation. CLLMs are trained using two loss terms: a consistency loss that ensures the model maps intermediate states to the fixed point and an AR loss that maintains generation quality. The method is efficient, requiring only moderate fine-tuning costs and can be integrated with other techniques for efficient LLM inference. CLLMs demonstrate significant acceleration through fast-forwarding and stationary tokens, allowing for faster convergence in Jacobi decoding. The approach is adaptable and does not require additional model components or auxiliary components, making it more memory-efficient and suitable for a wide range of applications. The results show that CLLMs outperform existing methods such as speculative decoding and Medusa in terms of speed and efficiency, with minimal memory consumption and high performance. The method is effective across various tasks and domains, demonstrating its potential for improving LLM inference efficiency.
Reach us at info@study.space
[slides and audio] CLLMs%3A Consistency Large Language Models