4 Sep 2024 | Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy
The paper "Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming" by Saikat Chakraborty et al. explores the use of AI to automate the construction of proof-oriented programs in the F* programming language. The authors curate a large dataset of 600K lines of open-source F* programs and proofs, including software used in production systems. They provide a program-fragment checker to verify the correctness of candidate solutions and report on an extended dataset of 940K lines of programs and proofs, with 54K top-level F* definitions. The main contributions of the paper are:
1. **FSTARDATASET**: A dataset of F* programs and proofs, extracted from 2060 source files across 8 open-source projects, totaling 600K lines of code. This dataset is the largest corpus of SMT-assisted proof-oriented programs and includes a reproducible program-fragment checker.
2. **Benchmark Problem**: Given a type as a formal specification, the benchmark aims to synthesize a program that has that type. Each of the 32K definitions in the dataset yields a synthesis problem, where the type of the definition is the goal type, and the task is to synthesize a definition that is type-correct.
3. **Neural Synthesis Techniques**: The authors apply large language models (LLMs) to generate solutions, backed by retrieval augmentation and fine-tuning techniques specific to the F* setting. They evaluate the performance of various LLMs, including GPT-3.5, GPT-4, and fine-tuned smaller models like Phi-2, Orca2, and StarCoder. Key findings include:
- Fine-tuned smaller models can match or outperform large language models at a lower computational cost.
- Different problem classes are solved with varying degrees of success, with common error modes differing between pre-trained and fine-tuned models.
- Leveraging contexts and retrieval augmentation significantly boosts the quality of results.
The paper also discusses the impact of different components in the prompt (file context, related examples, and selected premises) on model performance and provides a detailed analysis of the models' successes and failures. The authors conclude that while fine-tuned smaller models can effectively synthesize F* definitions and proofs, further improvements are needed in handling complex specifications and reducing errors.The paper "Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming" by Saikat Chakraborty et al. explores the use of AI to automate the construction of proof-oriented programs in the F* programming language. The authors curate a large dataset of 600K lines of open-source F* programs and proofs, including software used in production systems. They provide a program-fragment checker to verify the correctness of candidate solutions and report on an extended dataset of 940K lines of programs and proofs, with 54K top-level F* definitions. The main contributions of the paper are:
1. **FSTARDATASET**: A dataset of F* programs and proofs, extracted from 2060 source files across 8 open-source projects, totaling 600K lines of code. This dataset is the largest corpus of SMT-assisted proof-oriented programs and includes a reproducible program-fragment checker.
2. **Benchmark Problem**: Given a type as a formal specification, the benchmark aims to synthesize a program that has that type. Each of the 32K definitions in the dataset yields a synthesis problem, where the type of the definition is the goal type, and the task is to synthesize a definition that is type-correct.
3. **Neural Synthesis Techniques**: The authors apply large language models (LLMs) to generate solutions, backed by retrieval augmentation and fine-tuning techniques specific to the F* setting. They evaluate the performance of various LLMs, including GPT-3.5, GPT-4, and fine-tuned smaller models like Phi-2, Orca2, and StarCoder. Key findings include:
- Fine-tuned smaller models can match or outperform large language models at a lower computational cost.
- Different problem classes are solved with varying degrees of success, with common error modes differing between pre-trained and fine-tuned models.
- Leveraging contexts and retrieval augmentation significantly boosts the quality of results.
The paper also discusses the impact of different components in the prompt (file context, related examples, and selected premises) on model performance and provides a detailed analysis of the models' successes and failures. The authors conclude that while fine-tuned smaller models can effectively synthesize F* definitions and proofs, further improvements are needed in handling complex specifications and reducing errors.