Understanding Extract%2C Define%2C Canonicalize%3A An LLM-based Framework for Knowledge Graph Construction

The paper introduces a three-phase framework named Extract-Define-Canonicalize (EDC) for automated knowledge graph construction (KGC) from input text. EDC addresses the challenge of scaling large language models (LLMs) to handle complex and large schemas, which are common in real-world applications. The framework consists of three main phases: 1. **Open Information Extraction**: Extracts entity-relation triplets from input text using LLMs. 2. **Schema Definition**: Generates definitions for each component of the schema based on the extracted triplets. 3. **Schema Canonicalization**: Standardizes the triplets to eliminate redundancies and ambiguities, either by aligning with a pre-defined schema or self-generating a schema. The paper also introduces a trained component called the Schema Retriever, which retrieves schema elements relevant to the input text, enhancing the LLMs' extraction performance. Experiments on three KGC benchmarks show that EDC can extract high-quality triplets without parameter tuning and with significantly larger schemas compared to prior works. The Schema Retriever further improves EDC's performance, demonstrating its effectiveness in both automatic and manual evaluations.The paper introduces a three-phase framework named Extract-Define-Canonicalize (EDC) for automated knowledge graph construction (KGC) from input text. EDC addresses the challenge of scaling large language models (LLMs) to handle complex and large schemas, which are common in real-world applications. The framework consists of three main phases: 1. **Open Information Extraction**: Extracts entity-relation triplets from input text using LLMs. 2. **Schema Definition**: Generates definitions for each component of the schema based on the extracted triplets. 3. **Schema Canonicalization**: Standardizes the triplets to eliminate redundancies and ambiguities, either by aligning with a pre-defined schema or self-generating a schema. The paper also introduces a trained component called the Schema Retriever, which retrieves schema elements relevant to the input text, enhancing the LLMs' extraction performance. Experiments on three KGC benchmarks show that EDC can extract high-quality triplets without parameter tuning and with significantly larger schemas compared to prior works. The Schema Retriever further improves EDC's performance, demonstrating its effectiveness in both automatic and manual evaluations.

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

5 Apr 2024 | Bowen Zhang and Harold Soh