This paper investigates the impact of tokenization on numerical reasoning in large language models (LLMs), focusing on arithmetic tasks. Tokenization, the process of dividing input text into discrete tokens, is often overlooked but can introduce inductive biases that affect model performance. The study compares left-to-right (L2R) and right-to-left (R2L) tokenization schemes, finding that R2L tokenization significantly improves performance, especially in smaller models like GPT-3.5. The errors made by models using L2R tokenization exhibit stereotyped patterns, suggesting systematic but flawed reasoning. The paper also demonstrates that models can convert between tokenizations to improve performance, and that larger models are better at overriding tokenization-induced biases. Overall, the findings highlight the importance of carefully considering tokenization choices in LLMs to ensure accurate numerical reasoning.This paper investigates the impact of tokenization on numerical reasoning in large language models (LLMs), focusing on arithmetic tasks. Tokenization, the process of dividing input text into discrete tokens, is often overlooked but can introduce inductive biases that affect model performance. The study compares left-to-right (L2R) and right-to-left (R2L) tokenization schemes, finding that R2L tokenization significantly improves performance, especially in smaller models like GPT-3.5. The errors made by models using L2R tokenization exhibit stereotyped patterns, suggesting systematic but flawed reasoning. The paper also demonstrates that models can convert between tokenizations to improve performance, and that larger models are better at overriding tokenization-induced biases. Overall, the findings highlight the importance of carefully considering tokenization choices in LLMs to ensure accurate numerical reasoning.