This paper introduces DataTune, a method that improves synthetic data generation by transforming existing, publicly available datasets to better align with specific task requirements. The method involves two main steps: dataset retrieval and transformation. DataTune first identifies relevant datasets from a large collection, then transforms them to match the target task's needs. This approach significantly increases the diversity and difficulty of generated data on many tasks.
DataTune was evaluated on six diverse language-based tasks from the BIG-Bench benchmark. Results show that DataTune outperforms few-shot prompting and existing synthetic data generation methods. On average, DataTune improves performance by 49% over a few-shot prompting baseline and 34% over existing methods that use synthetic or retrieved training data. The method also produces more difficult and diverse examples, which do not come at the expense of data correctness.
The paper also discusses the limitations of DataTune, including high LLM query costs, dependence on the Planning Module, handling of non-English data, and reliance on instruction-following LLMs. The authors suggest that future work should focus on making the system more accessible and scalable, such as generating code to execute transformation plans rather than querying an LLM for each instance.
The ethical considerations of DataTune are also discussed, noting that while the tool could make it easier for the general public to build custom language models, it also raises concerns about the misuse of open-ended technology. The authors aim to balance these risks by making NLP models more accessible to those outside the NLP community or without the resources to manually collect labeled data. The system is open-sourced to encourage community contributions and improve its capabilities.This paper introduces DataTune, a method that improves synthetic data generation by transforming existing, publicly available datasets to better align with specific task requirements. The method involves two main steps: dataset retrieval and transformation. DataTune first identifies relevant datasets from a large collection, then transforms them to match the target task's needs. This approach significantly increases the diversity and difficulty of generated data on many tasks.
DataTune was evaluated on six diverse language-based tasks from the BIG-Bench benchmark. Results show that DataTune outperforms few-shot prompting and existing synthetic data generation methods. On average, DataTune improves performance by 49% over a few-shot prompting baseline and 34% over existing methods that use synthetic or retrieved training data. The method also produces more difficult and diverse examples, which do not come at the expense of data correctness.
The paper also discusses the limitations of DataTune, including high LLM query costs, dependence on the Planning Module, handling of non-English data, and reliance on instruction-following LLMs. The authors suggest that future work should focus on making the system more accessible and scalable, such as generating code to execute transformation plans rather than querying an LLM for each instance.
The ethical considerations of DataTune are also discussed, noting that while the tool could make it easier for the general public to build custom language models, it also raises concerns about the misuse of open-ended technology. The authors aim to balance these risks by making NLP models more accessible to those outside the NLP community or without the resources to manually collect labeled data. The system is open-sourced to encourage community contributions and improve its capabilities.