Better Synthetic Data by Retrieving and Transforming Existing Datasets

Better Synthetic Data by Retrieving and Transforming Existing Datasets

26 Apr 2024 | Saumya Gandhi*, Ritu Gala*, Vijay Viswanathan, Tongshuang Wu, Graham Neubig
The paper introduces DataTune, a method for improving the generation of synthetic datasets by transforming existing publicly available datasets to better align with specific task requirements. DataTune addresses the limitations of current synthetic data generation methods, which often lack complexity and diversity. The method involves two main steps: dataset retrieval and dataset transformation. Dataset retrieval uses a large language model (LLM) to find relevant datasets from a repository, followed by reranking these datasets to select the most suitable one. The dataset transformation step then uses the selected dataset and a detailed task description to generate a synthetic dataset that better meets the task requirements. The paper evaluates DataTune on six challenging language-based tasks from the BIG-Bench benchmark, showing that it improves performance over few-shot prompting and existing synthetic data generation methods. DataTune enhances dataset diversity and complexity, making it a valuable tool for improving the quality of synthetic datasets used in NLP model fine-tuning. The system is open-sourced to facilitate further research and application.The paper introduces DataTune, a method for improving the generation of synthetic datasets by transforming existing publicly available datasets to better align with specific task requirements. DataTune addresses the limitations of current synthetic data generation methods, which often lack complexity and diversity. The method involves two main steps: dataset retrieval and dataset transformation. Dataset retrieval uses a large language model (LLM) to find relevant datasets from a repository, followed by reranking these datasets to select the most suitable one. The dataset transformation step then uses the selected dataset and a detailed task description to generate a synthetic dataset that better meets the task requirements. The paper evaluates DataTune on six challenging language-based tasks from the BIG-Bench benchmark, showing that it improves performance over few-shot prompting and existing synthetic data generation methods. DataTune enhances dataset diversity and complexity, making it a valuable tool for improving the quality of synthetic datasets used in NLP model fine-tuning. The system is open-sourced to facilitate further research and application.
Reach us at info@study.space
[slides] Better Synthetic Data by Retrieving and Transforming Existing Datasets | StudySpace