Understanding CodeS%3A Towards Building Open-source Language Models for Text-to-SQL

This paper introduces CoDES, a series of open-source language models designed for text-to-SQL tasks, with parameter sizes ranging from 1B to 15B. CoDES is built upon StarCoder, a code generation model, and is trained using an incremental pre-training approach on a curated SQL-centric corpus. This approach enhances SQL generation and natural language comprehension. To improve schema linking and domain adaptation, the paper proposes a bi-directional data augmentation technique. The models are evaluated on multiple benchmarks, including Spider and BIRD, as well as robustness-diagnostic benchmarks like Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider. The results show that CoDES achieves new state-of-the-art accuracy and robustness on nearly all challenging text-to-SQL benchmarks. The paper also discusses the challenges in developing CoDES, including enabling small models with complex reasoning, generating high-quality prompts for schema linking, and adapting to new domains. The models are evaluated using both supervised fine-tuning and few-shot in-context learning. The paper concludes that CoDES is a promising open-source solution for text-to-SQL tasks, with the potential to enhance the performance of existing models and enable broader adoption in real-world applications.This paper introduces CoDES, a series of open-source language models designed for text-to-SQL tasks, with parameter sizes ranging from 1B to 15B. CoDES is built upon StarCoder, a code generation model, and is trained using an incremental pre-training approach on a curated SQL-centric corpus. This approach enhances SQL generation and natural language comprehension. To improve schema linking and domain adaptation, the paper proposes a bi-directional data augmentation technique. The models are evaluated on multiple benchmarks, including Spider and BIRD, as well as robustness-diagnostic benchmarks like Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider. The results show that CoDES achieves new state-of-the-art accuracy and robustness on nearly all challenging text-to-SQL benchmarks. The paper also discusses the challenges in developing CoDES, including enabling small models with complex reasoning, generating high-quality prompts for schema linking, and adapting to new domains. The models are evaluated using both supervised fine-tuning and few-shot in-context learning. The paper concludes that CoDES is a promising open-source solution for text-to-SQL tasks, with the potential to enhance the performance of existing models and enable broader adoption in real-world applications.

CODES: Towards Building Open-source Language Models for Text-to-SQL

2018 | Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen