24 Jun 2024 | Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui
This survey explores inference-time algorithms for large language models (LLMs), focusing on three areas: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling one token at a time or constructing a token-level search space. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation.
The survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems. It presents a mathematical formalism that includes both classical generation algorithms and modern meta-generators. This unified view is particularly important as the field expands, providing insights into the historical context of generation algorithms and major algorithmic patterns.
Token-level generation algorithms include greedy decoding, beam search, and various sampling methods like nucleus sampling and temperature sampling. These methods aim to generate sequences by maximizing a score or sampling from a distribution. Meta-generation algorithms use multiple calls to generation algorithms with control flow and external information to produce text. Efficient generation methods focus on reducing token costs and improving generation speed, often leveraging ideas from machine learning systems.
The survey discusses the importance of efficient generation, especially as LLMs are integrated into algorithms that call models many times. It covers various techniques for making generation fast and cost-effective, including methods that speed up generation from a systems perspective. The survey also addresses the trade-off between diversity and coherence in generation, and how different algorithms can balance these aspects.
Overall, the survey provides a comprehensive overview of inference-time algorithms for LLMs, highlighting the importance of these methods in improving the performance and efficiency of language models. It emphasizes the need for further research on inference-time approaches to enhance the capabilities of LLMs in various applications.This survey explores inference-time algorithms for large language models (LLMs), focusing on three areas: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling one token at a time or constructing a token-level search space. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation.
The survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems. It presents a mathematical formalism that includes both classical generation algorithms and modern meta-generators. This unified view is particularly important as the field expands, providing insights into the historical context of generation algorithms and major algorithmic patterns.
Token-level generation algorithms include greedy decoding, beam search, and various sampling methods like nucleus sampling and temperature sampling. These methods aim to generate sequences by maximizing a score or sampling from a distribution. Meta-generation algorithms use multiple calls to generation algorithms with control flow and external information to produce text. Efficient generation methods focus on reducing token costs and improving generation speed, often leveraging ideas from machine learning systems.
The survey discusses the importance of efficient generation, especially as LLMs are integrated into algorithms that call models many times. It covers various techniques for making generation fast and cost-effective, including methods that speed up generation from a systems perspective. The survey also addresses the trade-off between diversity and coherence in generation, and how different algorithms can balance these aspects.
Overall, the survey provides a comprehensive overview of inference-time algorithms for LLMs, highlighting the importance of these methods in improving the performance and efficiency of language models. It emphasizes the need for further research on inference-time approaches to enhance the capabilities of LLMs in various applications.