20 Jan 2024 | Guozheng Li, Xinyu Wang, Gerile Aodeng, Shunyuan Zheng, Yu Zhang, Chuangxin Ou, Song Wang, and Chi Harold Liu
This paper evaluates the capability of large language models (LLMs) to generate visualization specifications from natural language queries, focusing on the natural language to visualization (NL2VIS) task. The evaluation uses the GPT-3.5 model and the Vega-Lite visualization grammar, with the nvBench dataset as the benchmark. The study employs both zero-shot and few-shot prompt strategies to assess the performance of GPT-3.5 in generating Vega-Lite specifications. The results show that GPT-3.5 outperforms previous NL2VIS approaches, with the performance of few-shot prompts being significantly higher than that of zero-shot prompts. However, the study also identifies limitations of GPT-3.5, such as misunderstanding data attributes and generating incorrect specifications due to grammatical errors. The paper discusses these limitations and suggests directions for improving the NL2VIS benchmark, including correcting ground truth and reducing ambiguities in natural language queries. The contributions of the paper include evaluating the capability of LLMs for NL2VIS tasks and providing insights and directions for enhancing the efficiency of LLM-based visualization generation.This paper evaluates the capability of large language models (LLMs) to generate visualization specifications from natural language queries, focusing on the natural language to visualization (NL2VIS) task. The evaluation uses the GPT-3.5 model and the Vega-Lite visualization grammar, with the nvBench dataset as the benchmark. The study employs both zero-shot and few-shot prompt strategies to assess the performance of GPT-3.5 in generating Vega-Lite specifications. The results show that GPT-3.5 outperforms previous NL2VIS approaches, with the performance of few-shot prompts being significantly higher than that of zero-shot prompts. However, the study also identifies limitations of GPT-3.5, such as misunderstanding data attributes and generating incorrect specifications due to grammatical errors. The paper discusses these limitations and suggests directions for improving the NL2VIS benchmark, including correcting ground truth and reducing ambiguities in natural language queries. The contributions of the paper include evaluating the capability of LLMs for NL2VIS tasks and providing insights and directions for enhancing the efficiency of LLM-based visualization generation.