Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

27 May 2024 | Dixuan Wang1, Yanda Li1, Junyuan Jiang2, Zepeng Ding1, Guochao Jiang1, Jiaqing Liang1*, Deqing Yang1*
This paper investigates the vulnerability of Large Language Models (LLMs) to tokenization errors, which can lead to inaccurate responses to specific queries. The authors construct an adversarial dataset called ADT (Adversarial Dataset for Tokenizer) to challenge the tokenization capabilities of various LLMs. ADT consists of two subsets: ADT-Human, manually constructed, and ADT-Auto, automatically generated. The dataset includes sentences with challenging tokens that disrupt conventional tokenization processes, causing LLMs to produce incorrect responses. The authors evaluate the performance of several leading LLMs, including GPT-4o, Llama-3, and Qwen2.5-max, on both subsets of ADT. The results show that ADT effectively challenges the tokenization of these LLMs, leading to high error rates in their responses. The paper also introduces an automatic data generation framework that can be applied to any open-source LLM. The study highlights the importance of optimizing LLMs' tokenization processes and algorithms to improve their performance and address the limitations of current models.This paper investigates the vulnerability of Large Language Models (LLMs) to tokenization errors, which can lead to inaccurate responses to specific queries. The authors construct an adversarial dataset called ADT (Adversarial Dataset for Tokenizer) to challenge the tokenization capabilities of various LLMs. ADT consists of two subsets: ADT-Human, manually constructed, and ADT-Auto, automatically generated. The dataset includes sentences with challenging tokens that disrupt conventional tokenization processes, causing LLMs to produce incorrect responses. The authors evaluate the performance of several leading LLMs, including GPT-4o, Llama-3, and Qwen2.5-max, on both subsets of ADT. The results show that ADT effectively challenges the tokenization of these LLMs, leading to high error rates in their responses. The paper also introduces an automatic data generation framework that can be applied to any open-source LLM. The study highlights the importance of optimizing LLMs' tokenization processes and algorithms to improve their performance and address the limitations of current models.
Reach us at info@study.space