26 Mar 2024 | Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Shiwen Ni, Ge Zhang, Jie Fu, Min Yang
This paper introduces COIG-CQIA, a high-quality Chinese instruction tuning dataset designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data. The dataset is constructed from diverse Chinese internet sources, including Q&A forums, encyclopedic sites, content creation platforms, and examinations. The data undergoes rigorous filtering and manual review to ensure quality, diversity, and relevance. The dataset includes a wide range of tasks, such as information extraction, question answering, code generation, and more. The paper also explores the impact of different data sources on model performance and evaluates the effectiveness of COIG-CQIA through various benchmark tests and human evaluations. The results show that models fine-tuned on COIG-CQIA achieve competitive results in human assessment as well as knowledge and security benchmarks. The dataset is available at https://huggingface.co/datasets/ma-p/COIG-CQIA. The paper also discusses related work in instruction tuning datasets and data mixture for SFT, highlighting the importance of data quality and diversity in instruction tuning. The findings suggest that COIG-CQIA is a valuable resource for the Chinese NLP community, offering a comprehensive and high-quality dataset for instruction fine-tuning.This paper introduces COIG-CQIA, a high-quality Chinese instruction tuning dataset designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data. The dataset is constructed from diverse Chinese internet sources, including Q&A forums, encyclopedic sites, content creation platforms, and examinations. The data undergoes rigorous filtering and manual review to ensure quality, diversity, and relevance. The dataset includes a wide range of tasks, such as information extraction, question answering, code generation, and more. The paper also explores the impact of different data sources on model performance and evaluates the effectiveness of COIG-CQIA through various benchmark tests and human evaluations. The results show that models fine-tuned on COIG-CQIA achieve competitive results in human assessment as well as knowledge and security benchmarks. The dataset is available at https://huggingface.co/datasets/ma-p/COIG-CQIA. The paper also discusses related work in instruction tuning datasets and data mixture for SFT, highlighting the importance of data quality and diversity in instruction tuning. The findings suggest that COIG-CQIA is a valuable resource for the Chinese NLP community, offering a comprehensive and high-quality dataset for instruction fine-tuning.