2024-09-04 | Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy
This paper presents FSTARDATASET, a large corpus of 600K lines of open-source F* programs and proofs, including software used in production systems such as Windows, Linux, Python, and Firefox. The dataset contains around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem. The authors also provide a program-fragment checker that queries F* to check the correctness of candidate solutions. They report on an extended version of the dataset containing 940K lines of programs and proofs, with 54K top-level F* definitions.
The authors investigate the use of AI to synthesize programs and their proofs in F*, with promising results. They find that fine-tuned smaller language models (such as Phi-2 or StarCoder) perform comparably to large language models (such as GPT-4) at a much lower computational cost. They also identify various type-based retrieval augmentation techniques that significantly boost performance.
The authors design and evaluate neural synthesis techniques, applying a standard regime of prompting large language models (LLMs) to generate solutions, backed by retrieval augmentation and fine-tuning techniques specific to their setting. They find that fine-tuned smaller models can match or outperform large language models, and that leveraging the context and retrieval augmentation significantly boosts the quality of results.
The authors also describe a benchmark problem based on the dataset, where given a type as a formal specification, the task is to synthesize a program that has that type. They classify benchmark instances into three categories: simply typed, dependently typed, and fully specified proofs.
The authors evaluate the performance of various retrieval augmentation strategies and LLMs on this task. They find that the performance of prompted smaller models is significantly worse than that of larger models, but that fine-tuning significantly improves performance. They also find that fine-tuned models perform better on intra-project examples than on cross-project examples.
The authors also analyze the types of errors produced by the models, finding that syntax errors are the most common for GPT-3.5, while identifier not found errors are the most common for GPT-4. They also find that fine-tuned models produce fewer identifier not found errors.
The authors also evaluate the impact of different components of the prompt on model performance, finding that file context and related examples are very important for synthesizing definitions in F*. They also find that an ideal premise selection model could provide significant improvements, although their current premise selection model does not significantly impact performance.
The authors also evaluate the impact of maximum tokens on model performance, finding that increasing model capacity improves performance for both GPT-3.5 and GPT-4. They also evaluate the impact of different premise selection models, finding that fine-tuning improves performance across all models.
Finally, the authors evaluate alternative prompt preparation strategies, finding that natural language prompts perform slightly better than structured prompts, while completion-style prompts perform substantially worse. They conjectThis paper presents FSTARDATASET, a large corpus of 600K lines of open-source F* programs and proofs, including software used in production systems such as Windows, Linux, Python, and Firefox. The dataset contains around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem. The authors also provide a program-fragment checker that queries F* to check the correctness of candidate solutions. They report on an extended version of the dataset containing 940K lines of programs and proofs, with 54K top-level F* definitions.
The authors investigate the use of AI to synthesize programs and their proofs in F*, with promising results. They find that fine-tuned smaller language models (such as Phi-2 or StarCoder) perform comparably to large language models (such as GPT-4) at a much lower computational cost. They also identify various type-based retrieval augmentation techniques that significantly boost performance.
The authors design and evaluate neural synthesis techniques, applying a standard regime of prompting large language models (LLMs) to generate solutions, backed by retrieval augmentation and fine-tuning techniques specific to their setting. They find that fine-tuned smaller models can match or outperform large language models, and that leveraging the context and retrieval augmentation significantly boosts the quality of results.
The authors also describe a benchmark problem based on the dataset, where given a type as a formal specification, the task is to synthesize a program that has that type. They classify benchmark instances into three categories: simply typed, dependently typed, and fully specified proofs.
The authors evaluate the performance of various retrieval augmentation strategies and LLMs on this task. They find that the performance of prompted smaller models is significantly worse than that of larger models, but that fine-tuning significantly improves performance. They also find that fine-tuned models perform better on intra-project examples than on cross-project examples.
The authors also analyze the types of errors produced by the models, finding that syntax errors are the most common for GPT-3.5, while identifier not found errors are the most common for GPT-4. They also find that fine-tuned models produce fewer identifier not found errors.
The authors also evaluate the impact of different components of the prompt on model performance, finding that file context and related examples are very important for synthesizing definitions in F*. They also find that an ideal premise selection model could provide significant improvements, although their current premise selection model does not significantly impact performance.
The authors also evaluate the impact of maximum tokens on model performance, finding that increasing model capacity improves performance for both GPT-3.5 and GPT-4. They also evaluate the impact of different premise selection models, finding that fine-tuning improves performance across all models.
Finally, the authors evaluate alternative prompt preparation strategies, finding that natural language prompts perform slightly better than structured prompts, while completion-style prompts perform substantially worse. They conject