[slides] Rho-1%3A Not All Tokens Are What You Need

The paper "RHO-1: Not All Tokens Are What You Need" by Zhenghao Lin et al. challenges the conventional approach of applying a next-token prediction loss to all training tokens in language model pre-training. Instead, it proposes a new method called Selective Language Modeling (SLM), which selectively trains on useful tokens that align with the desired distribution. The authors conduct an initial analysis to examine token-level training dynamics, revealing distinct loss patterns for different tokens. They then introduce RHO-1, a language model that employs SLM, which scores pretraining tokens using a reference model and trains the language model with a focused loss on tokens with higher scores. The approach is evaluated on the 15B OpenWebMath corpus, achieving an absolute improvement in few-shot accuracy of up to 30% across 9 math tasks. After fine-tuning, RHO-1-B and 7B achieve state-of-the-art results of 40.6% and 51.8% on the MATH dataset, respectively, matching DeepSeekMath with only 3% of the pretraining tokens. Additionally, RHO-1 achieves a 6.8% average enhancement across 15 diverse tasks when pretraining on 80B general tokens, demonstrating its efficiency and performance in language model pre-training. The paper also includes detailed experimental setups, evaluation methods, and ablation studies to validate the effectiveness of SLM.The paper "RHO-1: Not All Tokens Are What You Need" by Zhenghao Lin et al. challenges the conventional approach of applying a next-token prediction loss to all training tokens in language model pre-training. Instead, it proposes a new method called Selective Language Modeling (SLM), which selectively trains on useful tokens that align with the desired distribution. The authors conduct an initial analysis to examine token-level training dynamics, revealing distinct loss patterns for different tokens. They then introduce RHO-1, a language model that employs SLM, which scores pretraining tokens using a reference model and trains the language model with a focused loss on tokens with higher scores. The approach is evaluated on the 15B OpenWebMath corpus, achieving an absolute improvement in few-shot accuracy of up to 30% across 9 math tasks. After fine-tuning, RHO-1-B and 7B achieve state-of-the-art results of 40.6% and 51.8% on the MATH dataset, respectively, matching DeepSeekMath with only 3% of the pretraining tokens. Additionally, RHO-1 achieves a 6.8% average enhancement across 15 diverse tasks when pretraining on 80B general tokens, demonstrating its efficiency and performance in language model pre-training. The paper also includes detailed experimental setups, evaluation methods, and ablation studies to validate the effectiveness of SLM.

RHO-1: Not All Tokens Are What You Need

23 May 2024 | Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen