Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

6 Mar 2024 | Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao, Ziyue Li, Hangyu Mao
This paper presents a comprehensive evaluation of the Text-to-SQL capability of Large Language Models (LLMs), focusing on benchmarking their performance across various sub-tasks. The study addresses the lack of a standardized benchmark for evaluating LLMs in Text-to-SQL, which has hindered the assessment of their cognitive abilities and the optimization of LLM-based solutions. To this end, the authors construct a new dataset, "BigTable-0.2k," to mitigate overfitting risks and evaluate five distinct tasks: Text-to-SQL, SQL Debugging, SQL Optimization, Schema Linking, and SQL-to-Text. The study evaluates the performance of various LLMs, including general-purpose and coding-specific models, across these tasks. It highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. The results show that coding-specific models, such as SQLCoder and CodeLlama, outperform general-purpose models in Text-to-SQL tasks. However, general-purpose models like InternLM and InternLM2 can achieve performance levels comparable to specialized models without fine-tuning for coding tasks. The study also explores the effectiveness of different prompt templates for Text-to-SQL, finding that the "SimpleDDL-MD-Chat" template consistently outperforms others. In SQL Debugging, the study demonstrates that detailed error information and corresponding annotations significantly enhance the capabilities of LLMs, enabling them to effectively correct errors. Multi-round self-debugging improves performance in the initial rounds but yields marginal gains later, suggesting that 1-2 rounds of debugging are optimal. For SQL Optimization, the study finds that in-context learning methods face challenges in achieving effective SQL optimization with LLMs. The use of a general-purpose model for semantic description of SQL statements is shown to be more effective than coding-specific models. In SQL-to-Text, general-purpose models like ChatGPT and InternLM2 demonstrate higher performance than coding-specific models. In Schema Linking, the study introduces a new metric, RES, to evaluate the performance of schema linking methods. It finds that code-specific models excel when using the PreSQL approach, while general-purpose models benefit from the Few-Shot + PreSQL method. The inclusion of foreign key information in prompts improves the performance of schema linking, as it helps models retrieve more ground truth tables by indicating potential table pairs involved in JOIN operations. Overall, the study provides valuable insights into the capabilities and limitations of LLMs in Text-to-SQL tasks, highlighting the importance of careful model selection and prompt engineering in achieving optimal outcomes. The findings contribute to the development of more reliable Text-to-SQL systems.This paper presents a comprehensive evaluation of the Text-to-SQL capability of Large Language Models (LLMs), focusing on benchmarking their performance across various sub-tasks. The study addresses the lack of a standardized benchmark for evaluating LLMs in Text-to-SQL, which has hindered the assessment of their cognitive abilities and the optimization of LLM-based solutions. To this end, the authors construct a new dataset, "BigTable-0.2k," to mitigate overfitting risks and evaluate five distinct tasks: Text-to-SQL, SQL Debugging, SQL Optimization, Schema Linking, and SQL-to-Text. The study evaluates the performance of various LLMs, including general-purpose and coding-specific models, across these tasks. It highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. The results show that coding-specific models, such as SQLCoder and CodeLlama, outperform general-purpose models in Text-to-SQL tasks. However, general-purpose models like InternLM and InternLM2 can achieve performance levels comparable to specialized models without fine-tuning for coding tasks. The study also explores the effectiveness of different prompt templates for Text-to-SQL, finding that the "SimpleDDL-MD-Chat" template consistently outperforms others. In SQL Debugging, the study demonstrates that detailed error information and corresponding annotations significantly enhance the capabilities of LLMs, enabling them to effectively correct errors. Multi-round self-debugging improves performance in the initial rounds but yields marginal gains later, suggesting that 1-2 rounds of debugging are optimal. For SQL Optimization, the study finds that in-context learning methods face challenges in achieving effective SQL optimization with LLMs. The use of a general-purpose model for semantic description of SQL statements is shown to be more effective than coding-specific models. In SQL-to-Text, general-purpose models like ChatGPT and InternLM2 demonstrate higher performance than coding-specific models. In Schema Linking, the study introduces a new metric, RES, to evaluate the performance of schema linking methods. It finds that code-specific models excel when using the PreSQL approach, while general-purpose models benefit from the Few-Shot + PreSQL method. The inclusion of foreign key information in prompts improves the performance of schema linking, as it helps models retrieve more ground truth tables by indicating potential table pairs involved in JOIN operations. Overall, the study provides valuable insights into the capabilities and limitations of LLMs in Text-to-SQL tasks, highlighting the importance of careful model selection and prompt engineering in achieving optimal outcomes. The findings contribute to the development of more reliable Text-to-SQL systems.
Reach us at info@study.space
[slides] Benchmarking the Text-to-SQL Capability of Large Language Models%3A A Comprehensive Evaluation | StudySpace