[slides and audio] Glitch Tokens in Large Language Models%3A Categorization Taxonomy and Effective Detection

This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection