Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

19 Apr 2024 | YUXI LI*, Huazhong University of Science and Technology, China YI LIU*, Nanyang Technological University, Singapore GELEI DENG, Nanyang Technological University, Singapore YING ZHANG, Virginia Tech, USA WENJIA SONG, Virginia Tech, USA LING SHI, Nanyang Technological University, Singapore KAILONG WANG†, Huazhong University of Science and Technology, China YUEKANG LI, The University of New South Wales, Australia YANG LIU, Nanyang Technological University, Singapore HAOYU WANG, Huazhong University of Science and Technology, China
This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.
Reach us at info@study.space
[slides and audio] Glitch Tokens in Large Language Models%3A Categorization Taxonomy and Effective Detection