GENIE: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION

GENIE: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION

25 Jan 2024 | Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Ariviv, Nathaniel Mills, Assaf Toledo, Eyal Shmarch, Leshem Choshen
**Genie: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION** The paper introduces Genie, a novel method for automatically generating high-quality content-grounded data, addressing the lack of such data in content-grounded generation tasks. Genie consists of three stages: Content Preparation, Generation, and Filtering. The method is applied to three large-scale synthetic datasets for Long-Form Question-Answering (LFQA), summarization, and information extraction. Human evaluations show that the generated data is natural and of high quality. Models trained on Genie-generated data perform on par with or outperform models trained on human-written data, with better faithfulness scores. The method is also applied to create LFQA data in the medical domain, demonstrating its versatility. **Key Contributions:** 1. **Genie Method:** A three-stage process for generating high-quality content-grounded data. 2. **Large-Scale Datasets:** Synthetic datasets for LFQA, summarization, and information extraction. 3. **Performance Comparison:** Genie-generated data outperforms or matches human-generated data in various metrics. 4. **Domain Adaptation:** Genie can generate high-quality data for different domains, including the medical domain. **Methodology:** - **Content Preparation:** Extracting content from source data. - **Generation:** Using a large language model to generate task-specific examples. - ** Filtering:** Ensuring data quality through format, faithfulness, and quality scoring. **Experiments:** - **LFQA:** Models trained on Genie-generated data outperform or match human-written data. - **Summarization:** Similar results are observed, showing the method's generality. - **Information Extraction:** Genie-generated data improves performance compared to the baseline. **Discussion:** - Genie is efficient and cost-effective, democratizing the creation of high-quality content-grounded datasets. - The method's effectiveness in noisy crawled data and its potential for various tasks and domains are highlighted. **Related Work:** - Genie differs from existing methods by focusing on content-grounded tasks and incorporating a filtering mechanism to ensure data quality. **Conclusion:** Genie provides a robust and flexible approach to generating high-quality content-grounded data, advancing the field of content-grounded generation tasks.**Genie: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION** The paper introduces Genie, a novel method for automatically generating high-quality content-grounded data, addressing the lack of such data in content-grounded generation tasks. Genie consists of three stages: Content Preparation, Generation, and Filtering. The method is applied to three large-scale synthetic datasets for Long-Form Question-Answering (LFQA), summarization, and information extraction. Human evaluations show that the generated data is natural and of high quality. Models trained on Genie-generated data perform on par with or outperform models trained on human-written data, with better faithfulness scores. The method is also applied to create LFQA data in the medical domain, demonstrating its versatility. **Key Contributions:** 1. **Genie Method:** A three-stage process for generating high-quality content-grounded data. 2. **Large-Scale Datasets:** Synthetic datasets for LFQA, summarization, and information extraction. 3. **Performance Comparison:** Genie-generated data outperforms or matches human-generated data in various metrics. 4. **Domain Adaptation:** Genie can generate high-quality data for different domains, including the medical domain. **Methodology:** - **Content Preparation:** Extracting content from source data. - **Generation:** Using a large language model to generate task-specific examples. - ** Filtering:** Ensuring data quality through format, faithfulness, and quality scoring. **Experiments:** - **LFQA:** Models trained on Genie-generated data outperform or match human-written data. - **Summarization:** Similar results are observed, showing the method's generality. - **Information Extraction:** Genie-generated data improves performance compared to the baseline. **Discussion:** - Genie is efficient and cost-effective, democratizing the creation of high-quality content-grounded datasets. - The method's effectiveness in noisy crawled data and its potential for various tasks and domains are highlighted. **Related Work:** - Genie differs from existing methods by focusing on content-grounded tasks and incorporating a filtering mechanism to ensure data quality. **Conclusion:** Genie provides a robust and flexible approach to generating high-quality content-grounded data, advancing the field of content-grounded generation tasks.
Reach us at info@study.space