The paper introduces the Syntax-Aware Fill-in-the-Middle (SAFIM) benchmark, a new evaluation framework for assessing Large Language Models (LLMs) on code Fill-in-the-Middle (FIM) tasks. SAFIM focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.
SAFIM emphasizes syntax-aware completion within code's Abstract Syntax Tree (AST), targeting algorithmic blocks, control-flow expressions, and API function calls. It is sourced from code on Codeforces and GitHub created after April 2022, deliberately aiming to avoid overlap with mainstream open-source pretraining corpora. SAFIM, with its 17,720 examples from 8,590 code files, not only surpasses the scale of HumanEval-Infilling but also expands the scope to include multiple programming languages. SAFIM primarily relies on execution-based evaluation, and uses syntactical match evaluation only when execution is not feasible due to external API calls.
A comprehensive evaluation of 15 LLMs on SAFIM reveals its effectiveness in providing a fair comparison of models. The paper implements five distinct prompt designs to accommodate various model types and introduces a syntax-aware truncation algorithm for post-processing the outputs. The findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The paper also discusses the impact of prompt designs, the efficacy of syntax-aware truncation, and a comparative analysis of various LLMs across tasks. The results show that pretraining methods and data quality are more important than sheer model size, and that FIM pretraining can enhance, rather than harm, Left-to-Right (L2R) inference capabilities. The paper concludes that SAFIM provides a comprehensive assessment of LLMs' coding capabilities across multiple dimensions and establishes a foundational framework for future research into pretraining paradigms and the development of better LLMs for coding tasks.The paper introduces the Syntax-Aware Fill-in-the-Middle (SAFIM) benchmark, a new evaluation framework for assessing Large Language Models (LLMs) on code Fill-in-the-Middle (FIM) tasks. SAFIM focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.
SAFIM emphasizes syntax-aware completion within code's Abstract Syntax Tree (AST), targeting algorithmic blocks, control-flow expressions, and API function calls. It is sourced from code on Codeforces and GitHub created after April 2022, deliberately aiming to avoid overlap with mainstream open-source pretraining corpora. SAFIM, with its 17,720 examples from 8,590 code files, not only surpasses the scale of HumanEval-Infilling but also expands the scope to include multiple programming languages. SAFIM primarily relies on execution-based evaluation, and uses syntactical match evaluation only when execution is not feasible due to external API calls.
A comprehensive evaluation of 15 LLMs on SAFIM reveals its effectiveness in providing a fair comparison of models. The paper implements five distinct prompt designs to accommodate various model types and introduces a syntax-aware truncation algorithm for post-processing the outputs. The findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The paper also discusses the impact of prompt designs, the efficacy of syntax-aware truncation, and a comparative analysis of various LLMs across tasks. The results show that pretraining methods and data quality are more important than sheer model size, and that FIM pretraining can enhance, rather than harm, Left-to-Right (L2R) inference capabilities. The paper concludes that SAFIM provides a comprehensive assessment of LLMs' coding capabilities across multiple dimensions and establishes a foundational framework for future research into pretraining paradigms and the development of better LLMs for coding tasks.