KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

14 Mar 2024 | Zixuan Li*, Yutao Zeng*, Yuxin Zuo*, Weicheng Ren*, Wenxuan Liu, Miao Su, Yucan Guo, Yantao Liu, Xiang Li, Zhilei Hu, Long Bai, Wei Li, Yidan Liu, Pan Yang, Xiaolong Jin*, Jiafeng Guo*, Xueqi Cheng
**KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction** This paper introduces KnowCoder, a Large Language Model (LLM) designed for Universal Information Extraction (UIE) through code generation. KnowCoder aims to develop a unified schema representation that LLMs can easily understand and an effective learning framework to encourage LLMs to follow schemas and extract structured knowledge accurately. To achieve this, KnowCoder introduces a code-style schema representation method that transforms different schemas into Python classes, capturing complex schema information in an LLM-friendly manner. A comprehensive code-style schema library covering over 30,000 types of knowledge is constructed, making it the largest such library for UIE. The learning framework consists of two phases: schema understanding and schema following. After code pretraining on around 1.5 billion automatically constructed data, KnowCoder achieves significant improvements, with a 49.8% relative F1 score improvement over LLaMA2 under the few-shot setting. Further instruction tuning on 1.5 billion automatically annotated data enhances KnowCoder's generalization ability, achieving up to 12.5% and 21.9% improvements under the zero-shot and low-resource settings, respectively. Additionally, KnowCoder can be refined using various human-annotated datasets, achieving up to 7.5% improvement under the supervised setting. **Contributions:** 1. A code-style schema representation method for uniform schema representation. 2. An effective two-phase learning framework for schema understanding and following. 3. Superior performance on different IE tasks under various evaluation settings. **Experiments:** - **Schema Representation:** KnowCoder uses Python classes to represent concepts, instances, and constraints, enhancing LLMs' understanding and following of schemas. - **Learning Framework:** Two phases—schema understanding and schema following—improve LLMs' ability to understand and follow schemas. - **Performance:** KnowCoder outperforms baselines in NER, RE, and ED tasks under zero-shot, low-resource, and supervised settings. **Limitations:** - The schema library is primarily constructed from Wikidata, which may lack definitions or relevant information for some schemas. **Conclusion:** KnowCoder leverages a code-style schema representation and a two-phase learning framework to enhance LLMs' ability to perform UIE tasks. It demonstrates significant improvements in various evaluation settings, making it a promising approach for structured knowledge extraction.**KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction** This paper introduces KnowCoder, a Large Language Model (LLM) designed for Universal Information Extraction (UIE) through code generation. KnowCoder aims to develop a unified schema representation that LLMs can easily understand and an effective learning framework to encourage LLMs to follow schemas and extract structured knowledge accurately. To achieve this, KnowCoder introduces a code-style schema representation method that transforms different schemas into Python classes, capturing complex schema information in an LLM-friendly manner. A comprehensive code-style schema library covering over 30,000 types of knowledge is constructed, making it the largest such library for UIE. The learning framework consists of two phases: schema understanding and schema following. After code pretraining on around 1.5 billion automatically constructed data, KnowCoder achieves significant improvements, with a 49.8% relative F1 score improvement over LLaMA2 under the few-shot setting. Further instruction tuning on 1.5 billion automatically annotated data enhances KnowCoder's generalization ability, achieving up to 12.5% and 21.9% improvements under the zero-shot and low-resource settings, respectively. Additionally, KnowCoder can be refined using various human-annotated datasets, achieving up to 7.5% improvement under the supervised setting. **Contributions:** 1. A code-style schema representation method for uniform schema representation. 2. An effective two-phase learning framework for schema understanding and following. 3. Superior performance on different IE tasks under various evaluation settings. **Experiments:** - **Schema Representation:** KnowCoder uses Python classes to represent concepts, instances, and constraints, enhancing LLMs' understanding and following of schemas. - **Learning Framework:** Two phases—schema understanding and schema following—improve LLMs' ability to understand and follow schemas. - **Performance:** KnowCoder outperforms baselines in NER, RE, and ED tasks under zero-shot, low-resource, and supervised settings. **Limitations:** - The schema library is primarily constructed from Wikidata, which may lack definitions or relevant information for some schemas. **Conclusion:** KnowCoder leverages a code-style schema representation and a two-phase learning framework to enhance LLMs' ability to perform UIE tasks. It demonstrates significant improvements in various evaluation settings, making it a promising approach for structured knowledge extraction.
Reach us at info@study.space