Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

2024 | Aaditya K. Singh, DJ Strouse
This paper investigates the impact of tokenization on arithmetic performance in frontier large language models (LLMs). Tokenization, the process of dividing input text into discrete tokens, is often overlooked but can significantly influence model performance. The study focuses on how different tokenization schemes affect numerical reasoning, particularly in GPT-3.5 and GPT-4 models. The research finds that right-to-left (R2L) tokenization, enforced by using commas to separate digits, leads to significantly improved performance on arithmetic tasks compared to left-to-right (L2R) tokenization. This improvement is attributed to better alignment between the tokenization of numbers and the model's processing of arithmetic operations. The study also reveals that errors in L2R tokenization follow stereotyped patterns, suggesting that model computations are systematic rather than approximate. The paper shows that models can convert between tokenization schemes, allowing chain-of-thought-inspired approaches to recover performance on L2R tokenized inputs. The gap between R2L and L2R performance decreases as models scale, indicating that larger models are better at overcoming tokenization-dependent inductive biases. The study highlights the importance of careful consideration of tokenization choices when developing models for numerical reasoning. It provides a thorough analysis of error patterns and demonstrates that tokenization can significantly affect model performance on arithmetic tasks. The findings suggest that tokenization-dependent inductive biases are a critical factor in the performance of large language models, and that practitioners should carefully evaluate and ablate these choices when working towards general models of numerical reasoning.This paper investigates the impact of tokenization on arithmetic performance in frontier large language models (LLMs). Tokenization, the process of dividing input text into discrete tokens, is often overlooked but can significantly influence model performance. The study focuses on how different tokenization schemes affect numerical reasoning, particularly in GPT-3.5 and GPT-4 models. The research finds that right-to-left (R2L) tokenization, enforced by using commas to separate digits, leads to significantly improved performance on arithmetic tasks compared to left-to-right (L2R) tokenization. This improvement is attributed to better alignment between the tokenization of numbers and the model's processing of arithmetic operations. The study also reveals that errors in L2R tokenization follow stereotyped patterns, suggesting that model computations are systematic rather than approximate. The paper shows that models can convert between tokenization schemes, allowing chain-of-thought-inspired approaches to recover performance on L2R tokenized inputs. The gap between R2L and L2R performance decreases as models scale, indicating that larger models are better at overcoming tokenization-dependent inductive biases. The study highlights the importance of careful consideration of tokenization choices when developing models for numerical reasoning. It provides a thorough analysis of error patterns and demonstrates that tokenization can significantly affect model performance on arithmetic tasks. The findings suggest that tokenization-dependent inductive biases are a critical factor in the performance of large language models, and that practitioners should carefully evaluate and ablate these choices when working towards general models of numerical reasoning.
Reach us at info@study.space
[slides and audio] Tokenization counts%3A the impact of tokenization on arithmetic in frontier LLMs