19 Apr 2024 | YUXI LI*, Huazhong University of Science and Technology, China
YI LIU*, Nanyang Technological University, Singapore
GELEI DENG, Nanyang Technological University, Singapore
YING ZHANG, Virginia Tech, USA
WENJIA SONG, Virginia Tech, USA
LING SHI, Nanyang Technological University, Singapore
KAILONG WANG†, Huazhong University of Science and Technology, China
YUEKANG LI, The University of New South Wales, Australia
YANG LIU, Nanyang Technological University, Singapore
HAOYU WANG, Huazhong University of Science and Technology, China
This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.This paper investigates the phenomenon of "glitch tokens" in large language models (LLMs), which are anomalous tokens that can lead to unpredictable or nonsensical outputs. The authors systematically explore the behavior of seven popular LLMs using three distinct tokenizers, identifying and categorizing 7,895 glitch tokens. They propose GLITCHHUNTER, an iterative clustering-based technique for efficient glitch token detection, which significantly reduces the number of queries and accelerates the detection process. The evaluation shows that GLITCHHUNTER outperforms three baseline methods on eight open-source LLMs, achieving up to 99.44% precision and 63.20% recall. The study provides valuable insights into mitigating tokenization-related errors in LLMs and contributes to the first comprehensive study on glitch tokens.