[slides and audio] From Decoding to Meta-Generation%3A Inference-time Algorithms for Large Language Models

This survey focuses on inference-time approaches for large language models (LLMs), exploring three main areas: token-level generation algorithms, meta-generation algorithms, and efficient generation methods. Token-level generation algorithms, often called decoding algorithms, operate by sampling tokens one at a time or constructing a token-level search space. These methods typically require access to the language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. The survey unifies perspectives from traditional natural language processing, modern LLMs, and machine learning systems, providing a unified mathematical formalism that includes both classical generation algorithms and modern meta-generators. The paper discusses the goals of generation, the modeling problem, and various algorithms within each category, including MAP decoding, sampling, and controlled generation. It also covers meta-generation patterns such as chained, parallel, step-level, and refinement-based meta-generators, and explores techniques for efficiency in terms of token cost and speed. The survey concludes with a discussion of takeaways, broader directions, and future work.This survey focuses on inference-time approaches for large language models (LLMs), exploring three main areas: token-level generation algorithms, meta-generation algorithms, and efficient generation methods. Token-level generation algorithms, often called decoding algorithms, operate by sampling tokens one at a time or constructing a token-level search space. These methods typically require access to the language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. The survey unifies perspectives from traditional natural language processing, modern LLMs, and machine learning systems, providing a unified mathematical formalism that includes both classical generation algorithms and modern meta-generators. The paper discusses the goals of generation, the modeling problem, and various algorithms within each category, including MAP decoding, sampling, and controlled generation. It also covers meta-generation patterns such as chained, parallel, step-level, and refinement-based meta-generators, and explores techniques for efficiency in terms of token cost and speed. The survey concludes with a discussion of takeaways, broader directions, and future work.

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

24 Jun 2024 | Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui