29 Jan 2024 | Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
This paper introduces Web Rephrase Augmented Pre-training (WRAP), a method to enhance the pre-training of large language models (LLMs) using synthetic data generated from rephrasing web documents. WRAP addresses the challenges of data curation, computational efficiency, and the quality of training data. By rephrasing web documents into different styles, such as Wikipedia-like or question-answer format, WRAP leverages the natural diversity of the web to generate high-quality synthetic data. The method is evaluated on the C4 dataset, showing a 3x speed-up in pre-training and a 10% improvement in perplexity. WRAP also outperforms models trained on synthetic data alone, demonstrating the benefits of combining real and synthetic data. The study highlights the importance of rephrasing style and the impact of synthetic data on out-of-distribution performance, providing insights into the composition of training data for LLMs.This paper introduces Web Rephrase Augmented Pre-training (WRAP), a method to enhance the pre-training of large language models (LLMs) using synthetic data generated from rephrasing web documents. WRAP addresses the challenges of data curation, computational efficiency, and the quality of training data. By rephrasing web documents into different styles, such as Wikipedia-like or question-answer format, WRAP leverages the natural diversity of the web to generate high-quality synthetic data. The method is evaluated on the C4 dataset, showing a 3x speed-up in pre-training and a 10% improvement in perplexity. WRAP also outperforms models trained on synthetic data alone, demonstrating the benefits of combining real and synthetic data. The study highlights the importance of rephrasing style and the impact of synthetic data on out-of-distribution performance, providing insights into the composition of training data for LLMs.