13 Jun 2024 | Siqi Kou * 1 Lanxiang Hu * 2 Zhezhi He 3 Zhijie Deng 1 Hao Zhang 2
The paper introduces Consistency Large Language Models (CLLMs), a novel approach to accelerate the inference of large language models (LLMs) using Jacobi decoding. Jacobi decoding, inspired by fixed-point iteration methods, aims to break the sequential nature of LLM decoding by allowing parallel computation, but it often achieves only marginal speedup compared to traditional autoregressive (AR) decoding due to the model's inability to predict multiple tokens in a single iteration. To address this, CLLMs refine the target LLM to consistently predict the fixed point given any input state, achieving faster convergence. The method is evaluated on various benchmarks, demonstrating a 2.4× to 3.4× improvement in generation speed while maintaining high quality. Key contributions include the development of CLLMs, the identification of fast forwarding and stationary tokens phenomena, and the demonstration of CLLMs' efficacy across domain-specific and open-domain benchmarks. The paper also discusses related work, methodology, experiments, and limitations, highlighting the adaptability and memory efficiency of CLLMs.The paper introduces Consistency Large Language Models (CLLMs), a novel approach to accelerate the inference of large language models (LLMs) using Jacobi decoding. Jacobi decoding, inspired by fixed-point iteration methods, aims to break the sequential nature of LLM decoding by allowing parallel computation, but it often achieves only marginal speedup compared to traditional autoregressive (AR) decoding due to the model's inability to predict multiple tokens in a single iteration. To address this, CLLMs refine the target LLM to consistently predict the fixed point given any input state, achieving faster convergence. The method is evaluated on various benchmarks, demonstrating a 2.4× to 3.4× improvement in generation speed while maintaining high quality. Key contributions include the development of CLLMs, the identification of fast forwarding and stationary tokens phenomena, and the demonstration of CLLMs' efficacy across domain-specific and open-domain benchmarks. The paper also discusses related work, methodology, experiments, and limitations, highlighting the adaptability and memory efficiency of CLLMs.