Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

29 Jan 2024 | Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
**Summary:** This paper introduces Web Rephrase Augmented Pre-training (WRAP), a method for training large language models (LLMs) using a combination of real and synthetic data generated by rephrasing web documents. WRAP leverages an off-the-shelf instruction-tuned model to rephrase documents in various styles, such as "like Wikipedia" or "question-answer format," to enhance pre-training efficiency and performance. The approach significantly reduces pre-training time and improves perplexity and zero-shot question-answering accuracy across multiple tasks. WRAP also addresses the challenges of data curation by incorporating style diversity and reducing the reliance on high-quality web data. The synthetic data generated through rephrasing is more effective than raw web data because it incorporates style diversity that aligns with downstream evaluation styles and has higher quality. Experiments show that WRAP outperforms models trained on real data alone, achieving better performance with less data and compute. The method also demonstrates that synthetic data can be combined with real data to improve model performance on various tasks, including zero-shot question-answering and language modeling. The paper highlights the importance of data diversity and the benefits of using synthetic data in pre-training, while also noting the challenges of data leakage and the need for careful data selection. Overall, WRAP provides a cost-effective and efficient approach to pre-training LLMs by leveraging synthetic data generated through rephrasing.**Summary:** This paper introduces Web Rephrase Augmented Pre-training (WRAP), a method for training large language models (LLMs) using a combination of real and synthetic data generated by rephrasing web documents. WRAP leverages an off-the-shelf instruction-tuned model to rephrase documents in various styles, such as "like Wikipedia" or "question-answer format," to enhance pre-training efficiency and performance. The approach significantly reduces pre-training time and improves perplexity and zero-shot question-answering accuracy across multiple tasks. WRAP also addresses the challenges of data curation by incorporating style diversity and reducing the reliance on high-quality web data. The synthetic data generated through rephrasing is more effective than raw web data because it incorporates style diversity that aligns with downstream evaluation styles and has higher quality. Experiments show that WRAP outperforms models trained on real data alone, achieving better performance with less data and compute. The method also demonstrates that synthetic data can be combined with real data to improve model performance on various tasks, including zero-shot question-answering and language modeling. The paper highlights the importance of data diversity and the benefits of using synthetic data in pre-training, while also noting the challenges of data leakage and the need for careful data selection. Overall, WRAP provides a cost-effective and efficient approach to pre-training LLMs by leveraging synthetic data generated through rephrasing.
Reach us at info@study.space
[slides] Rephrasing the Web%3A A Recipe for Compute and Data-Efficient Language Modeling | StudySpace