17 Jun 2024 | Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam
This paper provides a comprehensive analysis of various decoding methods in the context of large language models (LLMs), evaluating their performance, robustness, and speed across a wide range of tasks, models, and deployment environments. The study reveals that the optimal decoding method depends on the specific task, model, and the priority (e.g., performance vs. robustness vs. speed). Deterministic methods generally perform better on closed-ended tasks, while stochastic methods are more effective on open-ended tasks. The choice of hyperparameters significantly affects the performance of different methods, with some methods achieving superior results at the cost of extensive hyperparameter tuning. Stochastic methods like temperature sampling and top-$p$ sampling are generally faster than deterministic methods, but their performance is more sensitive to hyperparameters. The study also highlights the importance of self-consistency in stochastic methods, which can enhance task performance by sampling multiple generations and taking a majority vote. Additionally, the impact of model size and quantization on decoding methods is explored, showing that larger models and quantization techniques can reduce the differences in performance across methods. The findings provide valuable insights for researchers and practitioners in selecting and optimizing decoding methods for LLMs.This paper provides a comprehensive analysis of various decoding methods in the context of large language models (LLMs), evaluating their performance, robustness, and speed across a wide range of tasks, models, and deployment environments. The study reveals that the optimal decoding method depends on the specific task, model, and the priority (e.g., performance vs. robustness vs. speed). Deterministic methods generally perform better on closed-ended tasks, while stochastic methods are more effective on open-ended tasks. The choice of hyperparameters significantly affects the performance of different methods, with some methods achieving superior results at the cost of extensive hyperparameter tuning. Stochastic methods like temperature sampling and top-$p$ sampling are generally faster than deterministic methods, but their performance is more sensitive to hyperparameters. The study also highlights the importance of self-consistency in stochastic methods, which can enhance task performance by sampling multiple generations and taking a majority vote. Additionally, the impact of model size and quantization on decoding methods is explored, showing that larger models and quantization techniques can reduce the differences in performance across methods. The findings provide valuable insights for researchers and practitioners in selecting and optimizing decoding methods for LLMs.