Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation

Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation

12 Feb 2024 | Federico Ranaldi, Elena Sofia Ruzzetti, Dario Onorati, Leonardo Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli, Fabio Massimo Zanzotto
This paper investigates the impact of data contamination on the performance of GPT-3.5 in Text-to-SQL tasks. Data contamination refers to the situation where a model has been exposed to or trained on parts of the dataset that are later used for evaluation. The study introduces a novel method to detect data contamination in GPTs and examines GPT-3.5's performance on the Spider dataset and a new, unfamiliar dataset called Termite. The results show a significant performance drop in GPT-3.5 on Termite, even with adversarial table disconnection (ATD) modifications, highlighting the effect of data contamination on LLMs in Text-to-SQL translation tasks. The study compares the Spider dataset, a widely used benchmark for Text-to-SQL tasks, with Termite, a newly created dataset designed to be unseen and not used in pre-training. The results indicate that GPT-3.5 performs significantly better on Spider than on Termite, suggesting that data contamination may be influencing its performance. The study also finds that GPT-3.5 is more resistant to ATD perturbation on leaked data than on unseen data. The paper proposes a new metric, DC-accuracy, to measure the presence of data contamination by evaluating the model's ability to reconstruct masked database dumps. The results show that GPT-3.5 achieves higher DC-accuracy on Spider than on Termite, indicating data contamination. The study also finds that the performance of GPT-3.5 on Text-to-SQL tasks decreases as the query difficulty increases, with the most significant drop observed on EXTRA-HARD queries. The study concludes that data contamination plays a significant role in the performance of GPT-3.5 in Text-to-SQL tasks. The results suggest that the model's performance is highly influenced by its prior knowledge of the test data, and that data contamination can lead to overestimation of the model's capabilities. The study also finds that adversarial table disconnection has a more pronounced effect on Termite than on Spider, indicating that the model's performance is more vulnerable to data contamination on unseen data. The study highlights the importance of developing new datasets that are not used in pre-training to ensure fair evaluation of LLMs in Text-to-SQL tasks.This paper investigates the impact of data contamination on the performance of GPT-3.5 in Text-to-SQL tasks. Data contamination refers to the situation where a model has been exposed to or trained on parts of the dataset that are later used for evaluation. The study introduces a novel method to detect data contamination in GPTs and examines GPT-3.5's performance on the Spider dataset and a new, unfamiliar dataset called Termite. The results show a significant performance drop in GPT-3.5 on Termite, even with adversarial table disconnection (ATD) modifications, highlighting the effect of data contamination on LLMs in Text-to-SQL translation tasks. The study compares the Spider dataset, a widely used benchmark for Text-to-SQL tasks, with Termite, a newly created dataset designed to be unseen and not used in pre-training. The results indicate that GPT-3.5 performs significantly better on Spider than on Termite, suggesting that data contamination may be influencing its performance. The study also finds that GPT-3.5 is more resistant to ATD perturbation on leaked data than on unseen data. The paper proposes a new metric, DC-accuracy, to measure the presence of data contamination by evaluating the model's ability to reconstruct masked database dumps. The results show that GPT-3.5 achieves higher DC-accuracy on Spider than on Termite, indicating data contamination. The study also finds that the performance of GPT-3.5 on Text-to-SQL tasks decreases as the query difficulty increases, with the most significant drop observed on EXTRA-HARD queries. The study concludes that data contamination plays a significant role in the performance of GPT-3.5 in Text-to-SQL tasks. The results suggest that the model's performance is highly influenced by its prior knowledge of the test data, and that data contamination can lead to overestimation of the model's capabilities. The study also finds that adversarial table disconnection has a more pronounced effect on Termite than on Spider, indicating that the model's performance is more vulnerable to data contamination on unseen data. The study highlights the importance of developing new datasets that are not used in pre-training to ensure fair evaluation of LLMs in Text-to-SQL tasks.
Reach us at info@study.space
Understanding Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation