Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

5 Apr 2024 | Bowen Zhang and Harold Soh
This paper introduces a three-phase framework called Extract-Define-Canonicalize (EDC) for knowledge graph construction (KGC) using large language models (LLMs). The framework addresses the challenge of scaling KGC to real-world text by decomposing the process into three phases: open information extraction, schema definition, and schema canonicalization. EDC is flexible, as it can be applied whether a pre-defined schema is available or not. In the latter case, it automatically constructs a schema and performs self-canonicalization. To improve performance, a trained Schema Retriever is introduced, which retrieves schema elements relevant to the input text, enhancing the LLM's extraction capabilities in a retrieval-augmented generation-like manner. The framework is evaluated on three KGC benchmarks: WebNLG, REBEL, and Wiki-NRE. Results show that EDC outperforms state-of-the-art methods in both Target Alignment and Self Canonicalization settings. The Schema Retriever significantly improves EDC's performance by providing relevant schema elements during the extraction phase. EDC is able to extract high-quality triplets without parameter tuning and with significantly larger schemas compared to prior works. The framework is also shown to be effective in reducing redundancy and ambiguity in the knowledge graph, making it more useful for downstream tasks. The paper makes three main contributions: a flexible and performant LLM-based framework for KGC, a trained Schema Retriever for extracting relevant schema components, and empirical evidence demonstrating the effectiveness of EDC and the Schema Retriever.This paper introduces a three-phase framework called Extract-Define-Canonicalize (EDC) for knowledge graph construction (KGC) using large language models (LLMs). The framework addresses the challenge of scaling KGC to real-world text by decomposing the process into three phases: open information extraction, schema definition, and schema canonicalization. EDC is flexible, as it can be applied whether a pre-defined schema is available or not. In the latter case, it automatically constructs a schema and performs self-canonicalization. To improve performance, a trained Schema Retriever is introduced, which retrieves schema elements relevant to the input text, enhancing the LLM's extraction capabilities in a retrieval-augmented generation-like manner. The framework is evaluated on three KGC benchmarks: WebNLG, REBEL, and Wiki-NRE. Results show that EDC outperforms state-of-the-art methods in both Target Alignment and Self Canonicalization settings. The Schema Retriever significantly improves EDC's performance by providing relevant schema elements during the extraction phase. EDC is able to extract high-quality triplets without parameter tuning and with significantly larger schemas compared to prior works. The framework is also shown to be effective in reducing redundancy and ambiguity in the knowledge graph, making it more useful for downstream tasks. The paper makes three main contributions: a flexible and performant LLM-based framework for KGC, a trained Schema Retriever for extracting relevant schema components, and empirical evidence demonstrating the effectiveness of EDC and the Schema Retriever.
Reach us at info@study.space