14 Mar 2024 | Zixuan Li, Yutao Zeng, Yuxin Zuo, Weicheng Ren, Wenxuan Liu, Miao Su, Yucan Guo, Yantao Liu, Xiang Li, Zhilei Hu, Long Bai, Wei Li, Yidan Liu, Pan Yang, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
KnowCoder is a Large Language Model (LLM) designed for Universal Information Extraction (UIE) through code generation. The model introduces a code-style schema representation method to uniformly transform different schemas into Python classes, enabling LLMs to understand and extract structured knowledge accurately. This method captures complex schema information, such as constraints among tasks, in an LLM-friendly manner. A comprehensive code-style schema library covering over 30,000 types of knowledge is constructed, which is the largest for UIE. KnowCoder employs a two-phase learning framework: code pretraining to enhance schema understanding and instruction tuning to improve schema following. After pretraining on 1.5B automatically generated data, KnowCoder achieves a 49.8% relative improvement in F1 score compared to LLaMA2 under few-shot settings. After instruction tuning, it achieves up to 12.5% and 21.9% improvements under zero-shot and low-resource settings, respectively. The model also benefits from human-annotated datasets, achieving up to 7.5% improvements under supervised settings. The code-style schema representation method includes class inheritance, class comments, type hints, and class methods to model concept taxonomies, constraints, and extraction requirements. The schema library is built based on Wikidata, covering over 29,177 entity types, 876 relation types, and 519 event types. The learning framework includes a schema understanding phase for code pretraining and a schema following phase for instruction tuning. KnowCoder demonstrates strong generalization ability on unseen schemas and achieves significant improvements across various IE tasks under different settings. The model's code, training data, and schema library are released for future research.KnowCoder is a Large Language Model (LLM) designed for Universal Information Extraction (UIE) through code generation. The model introduces a code-style schema representation method to uniformly transform different schemas into Python classes, enabling LLMs to understand and extract structured knowledge accurately. This method captures complex schema information, such as constraints among tasks, in an LLM-friendly manner. A comprehensive code-style schema library covering over 30,000 types of knowledge is constructed, which is the largest for UIE. KnowCoder employs a two-phase learning framework: code pretraining to enhance schema understanding and instruction tuning to improve schema following. After pretraining on 1.5B automatically generated data, KnowCoder achieves a 49.8% relative improvement in F1 score compared to LLaMA2 under few-shot settings. After instruction tuning, it achieves up to 12.5% and 21.9% improvements under zero-shot and low-resource settings, respectively. The model also benefits from human-annotated datasets, achieving up to 7.5% improvements under supervised settings. The code-style schema representation method includes class inheritance, class comments, type hints, and class methods to model concept taxonomies, constraints, and extraction requirements. The schema library is built based on Wikidata, covering over 29,177 entity types, 876 relation types, and 519 event types. The learning framework includes a schema understanding phase for code pretraining and a schema following phase for instruction tuning. KnowCoder demonstrates strong generalization ability on unseen schemas and achieves significant improvements across various IE tasks under different settings. The model's code, training data, and schema library are released for future research.