Generating Automatic Feedback on UI Mockups with Large Language Models

Generating Automatic Feedback on UI Mockups with Large Language Models

19 Mar 2024 | Peitong Duan, Jeremy Warner, Yang Li, and Bjoern Hartmann
The paper explores the use of large language models (LLMs), particularly GPT-4, to automate heuristic evaluation of user interface (UI) mockups. The authors developed a Figma plugin that takes a UI design and a set of written heuristics, and generates automatic feedback in the form of constructive suggestions. The plugin allows designers to iteratively revise their UI mockups based on the LLM's feedback. The study includes three components: a performance study where three designers rated the accuracy and helpfulness of GPT-4's suggestions for 51 UIs, a heuristic evaluation study with 12 expert designers who manually identified guideline violations in 12 UIs, and an iterative usage study with another group of 12 designers who refined UIs based on GPT-4's feedback. The results show that GPT-4 generally provides accurate and helpful feedback, but its performance decreases over iterations as the design improves. Despite this, participants found the tool useful for catching subtle errors, improving text, and considering UI semantics. The study also highlights the limitations of GPT-4, such as its inability to handle complex UIs and its tendency to hallucinate. Overall, the paper suggests that while LLMs may not replace human heuristic evaluation, they can be useful in design practices, especially for catching subtle errors and improving text.The paper explores the use of large language models (LLMs), particularly GPT-4, to automate heuristic evaluation of user interface (UI) mockups. The authors developed a Figma plugin that takes a UI design and a set of written heuristics, and generates automatic feedback in the form of constructive suggestions. The plugin allows designers to iteratively revise their UI mockups based on the LLM's feedback. The study includes three components: a performance study where three designers rated the accuracy and helpfulness of GPT-4's suggestions for 51 UIs, a heuristic evaluation study with 12 expert designers who manually identified guideline violations in 12 UIs, and an iterative usage study with another group of 12 designers who refined UIs based on GPT-4's feedback. The results show that GPT-4 generally provides accurate and helpful feedback, but its performance decreases over iterations as the design improves. Despite this, participants found the tool useful for catching subtle errors, improving text, and considering UI semantics. The study also highlights the limitations of GPT-4, such as its inability to handle complex UIs and its tendency to hallucinate. Overall, the paper suggests that while LLMs may not replace human heuristic evaluation, they can be useful in design practices, especially for catching subtle errors and improving text.
Reach us at info@study.space
Understanding Generating Automatic Feedback on UI Mockups with Large Language Models