12 Feb 2024 | Federico Ranaldi, Elena Sofia Ruzzetti, Dario Onorati, Leonardo Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli, Fabio Massimo Zanzotto
This paper investigates the impact of data contamination on the performance of GPT-3.5 in text-to-SQL translation tasks. Data contamination refers to the scenario where a model has been exposed to or trained on parts of the dataset that are later used for evaluation. The study introduces a novel method to detect data contamination and examines GPT-3.5's performance using the Spider dataset and a new, unfamiliar dataset called Termite. Additionally, the model's efficacy is analyzed on databases with modified information via an adversarial table disconnection (ATD) approach, which complicates the task by removing structural pieces of information from the database.
The results indicate a significant performance drop in GPT-3.5 on the Termite dataset, even with ATD modifications, highlighting the effect of data contamination on LLMs in text-to-SQL translation tasks. The study concludes that data contamination is responsible for overestimating GPT-3.5's performance and suggests the need for more thorough reexamination of current LLM benchmarks and the development of public datasets that remain outside the LLM's pretraining.This paper investigates the impact of data contamination on the performance of GPT-3.5 in text-to-SQL translation tasks. Data contamination refers to the scenario where a model has been exposed to or trained on parts of the dataset that are later used for evaluation. The study introduces a novel method to detect data contamination and examines GPT-3.5's performance using the Spider dataset and a new, unfamiliar dataset called Termite. Additionally, the model's efficacy is analyzed on databases with modified information via an adversarial table disconnection (ATD) approach, which complicates the task by removing structural pieces of information from the database.
The results indicate a significant performance drop in GPT-3.5 on the Termite dataset, even with ATD modifications, highlighting the effect of data contamination on LLMs in text-to-SQL translation tasks. The study concludes that data contamination is responsible for overestimating GPT-3.5's performance and suggests the need for more thorough reexamination of current LLM benchmarks and the development of public datasets that remain outside the LLM's pretraining.