**Genie: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION**
The paper introduces Genie, a novel method for automatically generating high-quality content-grounded data, addressing the lack of such data in content-grounded generation tasks. Genie consists of three stages: Content Preparation, Generation, and Filtering. The method is applied to three large-scale synthetic datasets for Long-Form Question-Answering (LFQA), summarization, and information extraction. Human evaluations show that the generated data is natural and of high quality. Models trained on Genie-generated data perform on par with or outperform models trained on human-written data, with better faithfulness scores. The method is also applied to create LFQA data in the medical domain, demonstrating its versatility.
**Key Contributions:**
1. **Genie Method:** A three-stage process for generating high-quality content-grounded data.
2. **Large-Scale Datasets:** Synthetic datasets for LFQA, summarization, and information extraction.
3. **Performance Comparison:** Genie-generated data outperforms or matches human-generated data in various metrics.
4. **Domain Adaptation:** Genie can generate high-quality data for different domains, including the medical domain.
**Methodology:**
- **Content Preparation:** Extracting content from source data.
- **Generation:** Using a large language model to generate task-specific examples.
- ** Filtering:** Ensuring data quality through format, faithfulness, and quality scoring.
**Experiments:**
- **LFQA:** Models trained on Genie-generated data outperform or match human-written data.
- **Summarization:** Similar results are observed, showing the method's generality.
- **Information Extraction:** Genie-generated data improves performance compared to the baseline.
**Discussion:**
- Genie is efficient and cost-effective, democratizing the creation of high-quality content-grounded datasets.
- The method's effectiveness in noisy crawled data and its potential for various tasks and domains are highlighted.
**Related Work:**
- Genie differs from existing methods by focusing on content-grounded tasks and incorporating a filtering mechanism to ensure data quality.
**Conclusion:**
Genie provides a robust and flexible approach to generating high-quality content-grounded data, advancing the field of content-grounded generation tasks.**Genie: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION**
The paper introduces Genie, a novel method for automatically generating high-quality content-grounded data, addressing the lack of such data in content-grounded generation tasks. Genie consists of three stages: Content Preparation, Generation, and Filtering. The method is applied to three large-scale synthetic datasets for Long-Form Question-Answering (LFQA), summarization, and information extraction. Human evaluations show that the generated data is natural and of high quality. Models trained on Genie-generated data perform on par with or outperform models trained on human-written data, with better faithfulness scores. The method is also applied to create LFQA data in the medical domain, demonstrating its versatility.
**Key Contributions:**
1. **Genie Method:** A three-stage process for generating high-quality content-grounded data.
2. **Large-Scale Datasets:** Synthetic datasets for LFQA, summarization, and information extraction.
3. **Performance Comparison:** Genie-generated data outperforms or matches human-generated data in various metrics.
4. **Domain Adaptation:** Genie can generate high-quality data for different domains, including the medical domain.
**Methodology:**
- **Content Preparation:** Extracting content from source data.
- **Generation:** Using a large language model to generate task-specific examples.
- ** Filtering:** Ensuring data quality through format, faithfulness, and quality scoring.
**Experiments:**
- **LFQA:** Models trained on Genie-generated data outperform or match human-written data.
- **Summarization:** Similar results are observed, showing the method's generality.
- **Information Extraction:** Genie-generated data improves performance compared to the baseline.
**Discussion:**
- Genie is efficient and cost-effective, democratizing the creation of high-quality content-grounded datasets.
- The method's effectiveness in noisy crawled data and its potential for various tasks and domains are highlighted.
**Related Work:**
- Genie differs from existing methods by focusing on content-grounded tasks and incorporating a filtering mechanism to ensure data quality.
**Conclusion:**
Genie provides a robust and flexible approach to generating high-quality content-grounded data, advancing the field of content-grounded generation tasks.