July 2024 | MD RAKIB HOSSAIN MISU, CRISTINA V. LOPES, IRIS MA, JAMES NOBLE
This paper explores how large language models (LLMs) can be used to synthesize verified Dafny methods, a verification-aware programming language. The study evaluates the effectiveness of three prompt types—Contextless, Signature, and Chain of Thought (CoT) with retrieval augmentation—in generating correct Dafny code with formal specifications. Using 178 problems from the MBPP dataset, the researchers tested GPT-4 and PaLM-2, finding that GPT-4 outperformed PaLM-2 in generating verified Dafny methods. The best results were achieved with the CoT prompt that includes retrieval augmentation and step-by-step problem decomposition. GPT-4 successfully generated verified Dafny methods for 58% of the problems, while the Contextless and Signature prompts yielded lower success rates. The study contributes 153 verified Dafny solutions, including 50 manually written and 103 synthesized by GPT-4.
The results show that formal program verification can be beneficial for code-generating LLMs, and that LLMs can assist in program verification by synthesizing code, generating specifications, or acting as a "programmer's verification apprentice." The study also highlights that the approach of generating candidate solutions that are subsequently formally checked for correctness can be applied to other domains where correctness must be demonstrated. The paper introduces a new dataset, MBPP-DFY-153, containing 153 programming problems with specifications, solutions, and tests in Dafny. The findings suggest that LLMs can be effective in generating verified code, especially when using CoT prompts with retrieval augmentation. The study also identifies challenges in generating formal specifications and invariants, and highlights the importance of correct postconditions for verification. Overall, the research demonstrates the potential of LLMs in formal software verification and suggests that further exploration of this area could lead to significant improvements in code generation and verification.This paper explores how large language models (LLMs) can be used to synthesize verified Dafny methods, a verification-aware programming language. The study evaluates the effectiveness of three prompt types—Contextless, Signature, and Chain of Thought (CoT) with retrieval augmentation—in generating correct Dafny code with formal specifications. Using 178 problems from the MBPP dataset, the researchers tested GPT-4 and PaLM-2, finding that GPT-4 outperformed PaLM-2 in generating verified Dafny methods. The best results were achieved with the CoT prompt that includes retrieval augmentation and step-by-step problem decomposition. GPT-4 successfully generated verified Dafny methods for 58% of the problems, while the Contextless and Signature prompts yielded lower success rates. The study contributes 153 verified Dafny solutions, including 50 manually written and 103 synthesized by GPT-4.
The results show that formal program verification can be beneficial for code-generating LLMs, and that LLMs can assist in program verification by synthesizing code, generating specifications, or acting as a "programmer's verification apprentice." The study also highlights that the approach of generating candidate solutions that are subsequently formally checked for correctness can be applied to other domains where correctness must be demonstrated. The paper introduces a new dataset, MBPP-DFY-153, containing 153 programming problems with specifications, solutions, and tests in Dafny. The findings suggest that LLMs can be effective in generating verified code, especially when using CoT prompts with retrieval augmentation. The study also identifies challenges in generating formal specifications and invariants, and highlights the importance of correct postconditions for verification. Overall, the research demonstrates the potential of LLMs in formal software verification and suggests that further exploration of this area could lead to significant improvements in code generation and verification.