GENIE: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION

GENIE: ACHIEVING HUMAN PARITY IN CONTENT-GROUNDED DATASETS GENERATION

2024 | Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen
Genie is a novel method for automatically generating high-quality content-grounded data. It consists of three stages: content preparation, generation, and filtering. The content preparation stage involves extracting relevant information from raw data. The generation stage uses a large language model (LLM) to create task-specific examples, such as question-answer pairs or summaries. The filtering stage ensures the quality and faithfulness of the generated data by scoring examples based on format, faithfulness, and quality. Genie was used to generate three large-scale synthetic datasets for long-form question-answering (LFQA), summarization, and information extraction. In a human evaluation, the generated data was found to be natural and of high quality. Models trained on Genie-generated data performed as well as or better than models trained on human-written data, particularly in terms of faithfulness. The method was also applied to create LFQA data within the medical domain, and the resulting model outperformed models trained on other domains. Genie offers flexibility and can generate synthetic data for various domains and content-grounded tasks. It is more cost and time-efficient than traditional crowd-sourced data curation methods. The synthetic data generated by Genie was tested on various benchmarks, including ELI5, ASQA, and NQ. The results showed that Genie-generated data performed well in terms of ROUGE, BERT-Score, and reward model scores. The data was also found to be more faithful to the content and more diverse in terms of lexical diversity. In addition, Genie was tested for domain adaptation, where it was used to generate synthetic data for the medical domain. The results showed that the synthetic data outperformed human-generated data in terms of faithfulness and performance on benchmark tasks. Overall, Genie provides a cost-effective and efficient method for generating high-quality content-grounded data, which can be used to train models that perform as well as or better than models trained on human-generated data. The method is generalizable and can be applied to various tasks and domains.Genie is a novel method for automatically generating high-quality content-grounded data. It consists of three stages: content preparation, generation, and filtering. The content preparation stage involves extracting relevant information from raw data. The generation stage uses a large language model (LLM) to create task-specific examples, such as question-answer pairs or summaries. The filtering stage ensures the quality and faithfulness of the generated data by scoring examples based on format, faithfulness, and quality. Genie was used to generate three large-scale synthetic datasets for long-form question-answering (LFQA), summarization, and information extraction. In a human evaluation, the generated data was found to be natural and of high quality. Models trained on Genie-generated data performed as well as or better than models trained on human-written data, particularly in terms of faithfulness. The method was also applied to create LFQA data within the medical domain, and the resulting model outperformed models trained on other domains. Genie offers flexibility and can generate synthetic data for various domains and content-grounded tasks. It is more cost and time-efficient than traditional crowd-sourced data curation methods. The synthetic data generated by Genie was tested on various benchmarks, including ELI5, ASQA, and NQ. The results showed that Genie-generated data performed well in terms of ROUGE, BERT-Score, and reward model scores. The data was also found to be more faithful to the content and more diverse in terms of lexical diversity. In addition, Genie was tested for domain adaptation, where it was used to generate synthetic data for the medical domain. The results showed that the synthetic data outperformed human-generated data in terms of faithfulness and performance on benchmark tasks. Overall, Genie provides a cost-effective and efficient method for generating high-quality content-grounded data, which can be used to train models that perform as well as or better than models trained on human-generated data. The method is generalizable and can be applied to various tasks and domains.
Reach us at info@study.space