ReMatch: Retrieval Enhanced Schema Matching with LLMs

ReMatch: Retrieval Enhanced Schema Matching with LLMs

2024-05-30 | Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, Oren Elisha
ReMatch is a novel method for schema matching using retrieval-enhanced large language models (LLMs). It addresses the challenges of schema matching, such as textual and semantic heterogeneity and differences in schema sizes. Unlike previous methods, ReMatch does not require predefined mapping, model training, or access to source database data. Instead, it leverages the generative capabilities of LLMs to perform semantic ranking between two schemas, aligning with human matching processes. The method involves transforming schema tables and attributes into structured documents, retrieving relevant documents, and using LLMs to select the most similar target attributes. The method was evaluated on two datasets: MIMIC and Synthea. The results showed that ReMatch achieves high accuracy@K, with the best balance points found for the MIMIC dataset as (J=1, K=1) and (J=2, K=5). For Synthea, the optimal results were achieved without retrieval. The ablation study demonstrated that skipping the retrieval step led to inferior results, while adding guidance improved performance significantly. Compared to SMAT, a previous state-of-the-art non-LLM model, ReMatch outperformed it on both datasets, especially in realistic splits where only a small portion of mappings were available. ReMatch's results were more robust, and it achieved higher accuracy@5 on the Synthea dataset. The method also provides a new large dataset for schema matching, which will aid further research. In the future, the method plans to incorporate type constraints, foreign keys, and primary keys, as well as enhanced guidance mechanisms and enriched table and column descriptions.ReMatch is a novel method for schema matching using retrieval-enhanced large language models (LLMs). It addresses the challenges of schema matching, such as textual and semantic heterogeneity and differences in schema sizes. Unlike previous methods, ReMatch does not require predefined mapping, model training, or access to source database data. Instead, it leverages the generative capabilities of LLMs to perform semantic ranking between two schemas, aligning with human matching processes. The method involves transforming schema tables and attributes into structured documents, retrieving relevant documents, and using LLMs to select the most similar target attributes. The method was evaluated on two datasets: MIMIC and Synthea. The results showed that ReMatch achieves high accuracy@K, with the best balance points found for the MIMIC dataset as (J=1, K=1) and (J=2, K=5). For Synthea, the optimal results were achieved without retrieval. The ablation study demonstrated that skipping the retrieval step led to inferior results, while adding guidance improved performance significantly. Compared to SMAT, a previous state-of-the-art non-LLM model, ReMatch outperformed it on both datasets, especially in realistic splits where only a small portion of mappings were available. ReMatch's results were more robust, and it achieved higher accuracy@5 on the Synthea dataset. The method also provides a new large dataset for schema matching, which will aid further research. In the future, the method plans to incorporate type constraints, foreign keys, and primary keys, as well as enhanced guidance mechanisms and enriched table and column descriptions.
Reach us at info@study.space