PURPLE: Making a Large Language Model a Better SQL Writer

PURPLE: Making a Large Language Model a Better SQL Writer

29 Mar 2024 | Tonghui Ren†, Yuankai Fan†, Zhenying He†, Ren Huang†, Jiaqi Dai†, Can Huang†, Yinan Jing†, Kai Zhang†, Yifan Yang†, X.Sean Wang†
The paper "PURPLE: Making a Large Language Model a Better SQL Writer" by Tonghui Ren et al. addresses the challenge of improving the accuracy and reliability of Natural Language to SQL (NL2SQL) translations using Large Language Models (LLMs). The authors propose PURPLE, a novel approach that enhances LLMs by retrieving demonstrations containing the necessary logical operator compositions for the NL2SQL task. This method guides LLMs to produce more accurate and semantically correct SQL queries. Key contributions of PURPLE include: 1. **Schema Pruning**: Reduces the input length by excluding irrelevant schema items, simplifying the inference task for LLMs. 2. **Skeleton Prediction**: Uses a fine-tuned PLM to predict SQL skeletons, which help identify the requisite logical operator compositions. 3. **Demonstration Selection**: Selects demonstrations based on a four-level abstraction hierarchy of SQL composition knowledge, enhancing generalization and fuzzification. 4. **Database Adaption**: Adapts the generated SQL to specific database schemas and SQL dialects, addressing hallucination issues. The evaluation on four popular benchmarks (Spider, Spider-DK, SpiderSYN, and Spider-Realistic) shows that PURPLE achieves state-of-the-art performance, with an exact-set match accuracy of 80.5% and execution match accuracy of 87.8% on the Spider validation set. PURPLE also demonstrates robustness and cost-effectiveness, outperforming existing LLMs-based and PLMs-based approaches in various scenarios.The paper "PURPLE: Making a Large Language Model a Better SQL Writer" by Tonghui Ren et al. addresses the challenge of improving the accuracy and reliability of Natural Language to SQL (NL2SQL) translations using Large Language Models (LLMs). The authors propose PURPLE, a novel approach that enhances LLMs by retrieving demonstrations containing the necessary logical operator compositions for the NL2SQL task. This method guides LLMs to produce more accurate and semantically correct SQL queries. Key contributions of PURPLE include: 1. **Schema Pruning**: Reduces the input length by excluding irrelevant schema items, simplifying the inference task for LLMs. 2. **Skeleton Prediction**: Uses a fine-tuned PLM to predict SQL skeletons, which help identify the requisite logical operator compositions. 3. **Demonstration Selection**: Selects demonstrations based on a four-level abstraction hierarchy of SQL composition knowledge, enhancing generalization and fuzzification. 4. **Database Adaption**: Adapts the generated SQL to specific database schemas and SQL dialects, addressing hallucination issues. The evaluation on four popular benchmarks (Spider, Spider-DK, SpiderSYN, and Spider-Realistic) shows that PURPLE achieves state-of-the-art performance, with an exact-set match accuracy of 80.5% and execution match accuracy of 87.8% on the Spider validation set. PURPLE also demonstrates robustness and cost-effectiveness, outperforming existing LLMs-based and PLMs-based approaches in various scenarios.
Reach us at info@study.space