**ReMatch: Retrieval Enhanced Schema Matching with LLMs**
**Authors:** Eitam Sheetrit
**Abstract:** Schema matching is a critical task in data integration, aiming to align source and target schemas to establish element correspondence. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. While machine-learning-based solutions have been explored, they often suffer from low accuracy, require manual mapping for model training, or need access to source schema data, which may be unavailable due to privacy concerns. This paper introduces ReMatch, a novel method that uses retrieval-enhanced Large Language Models (LLMs) to match schemas without predefined mapping, model training, or access to source database data. Experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher, making it a viable solution for real-world scenarios.
**Key Contributions:**
1. Introduces a new method for schema matching that allows for scalable and accurate results without model training or access to labeled data.
2. Proposes a mechanism to reduce the search space of the target schema for efficient candidate generation.
3. Exploits the generative abilities of LLMs to perform semantic ranking between two schemas, aligning with human matchers.
4. Provides a complete mapping between two healthcare schemas, addressing the lack of real-world evaluation datasets in the field.
**Background and Related Work:**
- **Large Language Models (LLMs):** LLMs have shown significant advancements in various tasks requiring deep semantic understanding, including schema matching.
- **Schema Matching:** Traditional manual methods are time-consuming and prone to errors, leading to the development of automated methods using ML algorithms. Recent work leverages NLP to improve semantic mappings, but still faces challenges with extensive data tagging.
**Method Description:**
- **Problem Statement:** Given a source schema and a target schema, the task involves finding a mapping between their elements.
- **Method:** ReMatch transforms target schema tables and source schema attributes into structured documents, uses text embedding models for semantic similarity, and employs LLMs to select the most similar target attributes from the retrieved tables.
**Evaluation:**
- **Dataset Creation:** Two primary datasets, MIMIC-III to OMOP and OMOP Benchmark Synthea Dataset, were used for evaluation.
- **Experiments:** ReMatch and SMAT (a previous state-of-the-art method) were evaluated on both datasets using accuracy@K metrics.
- **Results:** ReMatch outperformed SMAT, especially on the more challenging MIMIC dataset, demonstrating its robustness and effectiveness.
**Conclusions:**
ReMatch is a scalable and effective method for schema matching using retrieval-enhanced LLMs, designed to complement and aid human matchers. Future work will focus on incorporating type constraints, foreign keys, and primary keys, as well as enhancing guidance mechanisms and enriching table and column descriptions.**ReMatch: Retrieval Enhanced Schema Matching with LLMs**
**Authors:** Eitam Sheetrit
**Abstract:** Schema matching is a critical task in data integration, aiming to align source and target schemas to establish element correspondence. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. While machine-learning-based solutions have been explored, they often suffer from low accuracy, require manual mapping for model training, or need access to source schema data, which may be unavailable due to privacy concerns. This paper introduces ReMatch, a novel method that uses retrieval-enhanced Large Language Models (LLMs) to match schemas without predefined mapping, model training, or access to source database data. Experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher, making it a viable solution for real-world scenarios.
**Key Contributions:**
1. Introduces a new method for schema matching that allows for scalable and accurate results without model training or access to labeled data.
2. Proposes a mechanism to reduce the search space of the target schema for efficient candidate generation.
3. Exploits the generative abilities of LLMs to perform semantic ranking between two schemas, aligning with human matchers.
4. Provides a complete mapping between two healthcare schemas, addressing the lack of real-world evaluation datasets in the field.
**Background and Related Work:**
- **Large Language Models (LLMs):** LLMs have shown significant advancements in various tasks requiring deep semantic understanding, including schema matching.
- **Schema Matching:** Traditional manual methods are time-consuming and prone to errors, leading to the development of automated methods using ML algorithms. Recent work leverages NLP to improve semantic mappings, but still faces challenges with extensive data tagging.
**Method Description:**
- **Problem Statement:** Given a source schema and a target schema, the task involves finding a mapping between their elements.
- **Method:** ReMatch transforms target schema tables and source schema attributes into structured documents, uses text embedding models for semantic similarity, and employs LLMs to select the most similar target attributes from the retrieved tables.
**Evaluation:**
- **Dataset Creation:** Two primary datasets, MIMIC-III to OMOP and OMOP Benchmark Synthea Dataset, were used for evaluation.
- **Experiments:** ReMatch and SMAT (a previous state-of-the-art method) were evaluated on both datasets using accuracy@K metrics.
- **Results:** ReMatch outperformed SMAT, especially on the more challenging MIMIC dataset, demonstrating its robustness and effectiveness.
**Conclusions:**
ReMatch is a scalable and effective method for schema matching using retrieval-enhanced LLMs, designed to complement and aid human matchers. Future work will focus on incorporating type constraints, foreign keys, and primary keys, as well as enhancing guidance mechanisms and enriching table and column descriptions.