Visualization Generation with Large Language Models: An Evaluation

Visualization Generation with Large Language Models: An Evaluation

20 Jan 2024 | Guozheng Li, Xinyu Wang, Gerile Aodeng, Shunyuan Zheng, Yu Zhang, Chuangxin Ou, Song Wang, and Chi Harold Liu
This paper evaluates the capability of large language models (LLMs) in generating visualization specifications from natural language queries (NL2VIS). The study uses GPT-3.5 as the LLM and Vega-Lite as the visualization grammar. The evaluation is conducted on the nvBench dataset, using both zero-shot and few-shot prompt strategies. The results show that GPT-3.5 outperforms previous NL2VIS approaches, with few-shot prompts performing better than zero-shot prompts. However, GPT-3.5 has limitations, such as misunderstanding data attributes and generating grammar errors in specifications. The paper also identifies issues in the existing NL2VIS benchmark, such as ambiguous queries and incorrect visualization results. The study suggests that improving the benchmark by correcting ground truths and reducing ambiguity in queries can enhance the NL2VIS task. The evaluation highlights the potential of LLMs in NL2VIS and the importance of using appropriate prompt strategies to improve performance. The findings indicate that while GPT-3.5 is effective, further improvements are needed to address its limitations in understanding complex queries and generating accurate specifications. The study contributes to the field by providing insights into the capabilities and limitations of LLMs in NL2VIS and by proposing directions for improving the benchmark.This paper evaluates the capability of large language models (LLMs) in generating visualization specifications from natural language queries (NL2VIS). The study uses GPT-3.5 as the LLM and Vega-Lite as the visualization grammar. The evaluation is conducted on the nvBench dataset, using both zero-shot and few-shot prompt strategies. The results show that GPT-3.5 outperforms previous NL2VIS approaches, with few-shot prompts performing better than zero-shot prompts. However, GPT-3.5 has limitations, such as misunderstanding data attributes and generating grammar errors in specifications. The paper also identifies issues in the existing NL2VIS benchmark, such as ambiguous queries and incorrect visualization results. The study suggests that improving the benchmark by correcting ground truths and reducing ambiguity in queries can enhance the NL2VIS task. The evaluation highlights the potential of LLMs in NL2VIS and the importance of using appropriate prompt strategies to improve performance. The findings indicate that while GPT-3.5 is effective, further improvements are needed to address its limitations in understanding complex queries and generating accurate specifications. The study contributes to the field by providing insights into the capabilities and limitations of LLMs in NL2VIS and by proposing directions for improving the benchmark.
Reach us at info@study.space
[slides and audio] Visualization Generation with Large Language Models%3A An Evaluation