Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

2024 | Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung
The paper introduces Syntax-Aware Fill-in-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on code Fill-in-the-Middle (FIM) tasks. SAFIM focuses on syntax-aware completions of program structures, including code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. The benchmark provides a robust framework with various prompt designs and syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. The evaluation of 15 LLMs on SAFIM reveals that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. The findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at <https://github.com/gonglinyuan/safim>, and the leaderboard is available at <https://safimbenchmark.com>. Recent advances in LLMs have revolutionized coding tasks, but existing benchmarks like HumanEval and MBPP focus on generating standalone functions or single-file code from natural language descriptions, lacking the common practice of modifying and expanding existing code. SAFIM addresses this gap by emphasizing syntax-aware completion within code's Abstract Syntax Tree (AST), targeting algorithmic blocks, control-flow expressions, and API function calls. The benchmark is sourced from Codeforces and GitHub, ensuring recent code and avoiding overlap with major pretraining datasets to reduce data contamination. The SAFIM benchmark is constructed using corpora from Codeforces and GitHub, with a focus on recent code created after April 2022. The corpora are processed to create structured FIM tasks across three splits: algorithmic block completion, control-flow completion, and API function call completion. The evaluation protocols include execution-based testing and syntactical match evaluation, with a focus on robust and fair comparisons. The paper introduces a suite of prompts and a syntax-aware truncation algorithm to refine model outputs. Prompt designs include Left-to-Right (L2R), Prefix-Suffix-Middle (PSM), Suffix-Prefix-Middle (SPM), and Instructed Prefix Feeding (IPF). The syntax-aware truncation algorithm ensures precise extraction of targeted code structures, enhancing the quality of FIM outputs and enabling fair comparisons across different models. The experimental results highlight the impact of prompt designs and syntax-aware truncation on model performance. FIM pretraining enhances both FIM and L2R performance, and smaller models with sophisticated pretraining paradigms can match or outperform larger counterparts. The study emphasizes the importance of pretraining methods and data quality over sheer model size, challenging the common belief that larger models always perform better. The findings suggestThe paper introduces Syntax-Aware Fill-in-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on code Fill-in-the-Middle (FIM) tasks. SAFIM focuses on syntax-aware completions of program structures, including code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. The benchmark provides a robust framework with various prompt designs and syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. The evaluation of 15 LLMs on SAFIM reveals that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. The findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at <https://github.com/gonglinyuan/safim>, and the leaderboard is available at <https://safimbenchmark.com>. Recent advances in LLMs have revolutionized coding tasks, but existing benchmarks like HumanEval and MBPP focus on generating standalone functions or single-file code from natural language descriptions, lacking the common practice of modifying and expanding existing code. SAFIM addresses this gap by emphasizing syntax-aware completion within code's Abstract Syntax Tree (AST), targeting algorithmic blocks, control-flow expressions, and API function calls. The benchmark is sourced from Codeforces and GitHub, ensuring recent code and avoiding overlap with major pretraining datasets to reduce data contamination. The SAFIM benchmark is constructed using corpora from Codeforces and GitHub, with a focus on recent code created after April 2022. The corpora are processed to create structured FIM tasks across three splits: algorithmic block completion, control-flow completion, and API function call completion. The evaluation protocols include execution-based testing and syntactical match evaluation, with a focus on robust and fair comparisons. The paper introduces a suite of prompts and a syntax-aware truncation algorithm to refine model outputs. Prompt designs include Left-to-Right (L2R), Prefix-Suffix-Middle (PSM), Suffix-Prefix-Middle (SPM), and Instructed Prefix Feeding (IPF). The syntax-aware truncation algorithm ensures precise extraction of targeted code structures, enhancing the quality of FIM outputs and enabling fair comparisons across different models. The experimental results highlight the impact of prompt designs and syntax-aware truncation on model performance. FIM pretraining enhances both FIM and L2R performance, and smaller models with sophisticated pretraining paradigms can match or outperform larger counterparts. The study emphasizes the importance of pretraining methods and data quality over sheer model size, challenging the common belief that larger models always perform better. The findings suggest
Reach us at info@study.space
Understanding Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks