26 Feb 2024 | Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiiping Li, Hong Chen
The paper introduces CodeS, a series of open-source pre-trained language models designed for the text-to-SQL task. These models, ranging from 1B to 15B parameters, are specifically optimized for SQL generation and aim to address the limitations of closed-source large language models (LLMs) such as ChatGPT and GPT-4. CodeS is built on StarCoder, a cutting-edge LLM designed for code generation, and employs an incremental pre-training approach using a curated SQL-centric corpus. The paper outlines the challenges in developing CodeS, including schema linking and rapid domain adaptation, and presents solutions such as a schema filtering strategy and a bi-directional data augmentation technique. Extensive evaluations on multiple benchmarks, including Spider, BIRD, and real-world datasets, demonstrate that CodeS achieves state-of-the-art accuracy and robustness in text-to-SQL tasks. The paper also discusses related work, provides a detailed overview of the CodeS framework, and presents experimental results to validate the effectiveness of CodeS.The paper introduces CodeS, a series of open-source pre-trained language models designed for the text-to-SQL task. These models, ranging from 1B to 15B parameters, are specifically optimized for SQL generation and aim to address the limitations of closed-source large language models (LLMs) such as ChatGPT and GPT-4. CodeS is built on StarCoder, a cutting-edge LLM designed for code generation, and employs an incremental pre-training approach using a curated SQL-centric corpus. The paper outlines the challenges in developing CodeS, including schema linking and rapid domain adaptation, and presents solutions such as a schema filtering strategy and a bi-directional data augmentation technique. Extensive evaluations on multiple benchmarks, including Spider, BIRD, and real-world datasets, demonstrate that CodeS achieves state-of-the-art accuracy and robustness in text-to-SQL tasks. The paper also discusses related work, provides a detailed overview of the CodeS framework, and presents experimental results to validate the effectiveness of CodeS.