26 Mar 2024 | Yuelin Bai*, Xinrun Du*, Yiming Liang*, Yonggang Jin*, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhua Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang
The paper introduces COIG-CQIA, a high-quality Chinese instruction fine-tuning dataset designed to align with human interactions and improve the performance of large language models (LLMs) in handling Chinese instructions. The dataset is curated from various sources on the Chinese internet, including Q&A communities, wikis, examinations, and existing NLP datasets, ensuring diversity, quality, and relevance. The authors collect and process a large corpus of human-written content, which is then used to train models of different scales. The paper also explores the impact of different data sources on model performance and provides insights into the selection of training data from the Chinese internet. Experimental results show that models fine-tuned on COIG-CQIA achieve competitive performance in human assessments and benchmarks, making it a valuable resource for the Chinese NLP community. The dataset is available at <https://huggingface.co/datasets/m-a-p/COIG-CQIA>.The paper introduces COIG-CQIA, a high-quality Chinese instruction fine-tuning dataset designed to align with human interactions and improve the performance of large language models (LLMs) in handling Chinese instructions. The dataset is curated from various sources on the Chinese internet, including Q&A communities, wikis, examinations, and existing NLP datasets, ensuring diversity, quality, and relevance. The authors collect and process a large corpus of human-written content, which is then used to train models of different scales. The paper also explores the impact of different data sources on model performance and provides insights into the selection of training data from the Chinese internet. Experimental results show that models fine-tuned on COIG-CQIA achieve competitive performance in human assessments and benchmarks, making it a valuable resource for the Chinese NLP community. The dataset is available at <https://huggingface.co/datasets/m-a-p/COIG-CQIA>.