April 2024 | JIALU ZHANG*, University of Waterloo, Canada; JOSÉ PABLO CAMBRONERO†, Microsoft, USA; SUMIT GULWANI†, Microsoft, USA; VU LE†, Microsoft, USA; RUZICA PISKAC†, Yale University, USA; GUSTAVO SOARES†, Microsoft, USA; GUST VERBRUGGEN†, Microsoft, Belgium
PyDex is an automated program repair (APR) system designed to fix bugs in introductory Python assignments. It leverages a large language model (LLM) trained on code, such as Codex (a version of GPT), to address both syntactic and semantic mistakes. The system combines multi-modal prompts, iterative querying, test-case-based few-shot learning, and program chunking to generate and refine repair candidates. PyDex evaluates 286 real student programs from an introductory Python course at a major university in India. Compared to three baselines—BIFI (a syntax repair tool), Refactory (a semantic repair tool), and GenProg (a semantic repair tool based on genetic programming)—PyDex achieves a higher repair rate (96.5%) and produces smaller patches (on average, 29.68 tokens) without few-shot learning. With few-shot learning, PyDex's repair rate climbs to 86.71%, outperforming the baselines in terms of both repair rate and patch size. The evaluation also highlights the importance of design choices, such as program chunking, iterative querying, and multimodal prompts, in achieving these results. PyDex's approach demonstrates the potential of LLMs in educational settings to provide effective and efficient feedback for students.PyDex is an automated program repair (APR) system designed to fix bugs in introductory Python assignments. It leverages a large language model (LLM) trained on code, such as Codex (a version of GPT), to address both syntactic and semantic mistakes. The system combines multi-modal prompts, iterative querying, test-case-based few-shot learning, and program chunking to generate and refine repair candidates. PyDex evaluates 286 real student programs from an introductory Python course at a major university in India. Compared to three baselines—BIFI (a syntax repair tool), Refactory (a semantic repair tool), and GenProg (a semantic repair tool based on genetic programming)—PyDex achieves a higher repair rate (96.5%) and produces smaller patches (on average, 29.68 tokens) without few-shot learning. With few-shot learning, PyDex's repair rate climbs to 86.71%, outperforming the baselines in terms of both repair rate and patch size. The evaluation also highlights the importance of design choices, such as program chunking, iterative querying, and multimodal prompts, in achieving these results. PyDex's approach demonstrates the potential of LLMs in educational settings to provide effective and efficient feedback for students.