RHO-1: Not All Tokens Are What You Need

RHO-1: Not All Tokens Are What You Need

23 May 2024 | Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen
RHO-1: Not All Tokens Are What You Need This paper introduces RHO-1, a new language model that employs Selective Language Modeling (SLM) to improve the efficiency and performance of pre-training. Unlike traditional language models that train on all tokens, RHO-1 focuses on training only on tokens that are most beneficial for language model training. The SLM approach involves scoring pre-training tokens using a reference model and then training the language model with a focused loss on tokens with higher scores. This approach leads to significant improvements in few-shot accuracy and performance on downstream tasks. The paper shows that not all tokens in a corpus are equally important for language model training. By analyzing token-level training dynamics, the authors found that many tokens are "easy tokens" that are already learned, while others are "hard tokens" that exhibit variable losses and resist convergence. These tokens can lead to numerous ineffective gradient updates. The SLM approach is shown to significantly enhance token efficiency during pre-training and improve performance on downstream tasks. The paper also demonstrates that SLM effectively identifies tokens relevant to the target distribution, resulting in improved perplexity scores on benchmarks for models trained with the selected tokens. In experiments, RHO-1 achieves state-of-the-art results on the MATH dataset, with RHO-1-1B and 7B achieving 40.6% and 51.8% accuracy, respectively. Additionally, when pre-trained on 80B general tokens, RHO-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training. The paper also shows that SLM can be used for self-referencing, leading to an average improvement of up to 3.3% in downstream tasks. The results demonstrate that SLM is an effective approach for improving the efficiency and performance of language model pre-training.RHO-1: Not All Tokens Are What You Need This paper introduces RHO-1, a new language model that employs Selective Language Modeling (SLM) to improve the efficiency and performance of pre-training. Unlike traditional language models that train on all tokens, RHO-1 focuses on training only on tokens that are most beneficial for language model training. The SLM approach involves scoring pre-training tokens using a reference model and then training the language model with a focused loss on tokens with higher scores. This approach leads to significant improvements in few-shot accuracy and performance on downstream tasks. The paper shows that not all tokens in a corpus are equally important for language model training. By analyzing token-level training dynamics, the authors found that many tokens are "easy tokens" that are already learned, while others are "hard tokens" that exhibit variable losses and resist convergence. These tokens can lead to numerous ineffective gradient updates. The SLM approach is shown to significantly enhance token efficiency during pre-training and improve performance on downstream tasks. The paper also demonstrates that SLM effectively identifies tokens relevant to the target distribution, resulting in improved perplexity scores on benchmarks for models trained with the selected tokens. In experiments, RHO-1 achieves state-of-the-art results on the MATH dataset, with RHO-1-1B and 7B achieving 40.6% and 51.8% accuracy, respectively. Additionally, when pre-trained on 80B general tokens, RHO-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training. The paper also shows that SLM can be used for self-referencing, leading to an average improvement of up to 3.3% in downstream tasks. The results demonstrate that SLM is an effective approach for improving the efficiency and performance of language model pre-training.
Reach us at info@study.space
[slides and audio] Rho-1%3A Not All Tokens Are What You Need