Understanding Automated Unit Test Improvement using Large Language Models at Meta

This paper introduces TestGen-LLM, a tool developed by Meta that uses large language models (LLMs) to automatically improve existing human-written unit tests. TestGen-LLM ensures that the generated test cases meet specific criteria, such as being buildable, passing reliable tests, and increasing coverage, thereby eliminating issues related to LLM hallucination. The tool has been deployed at Meta's test-a-thons for Instagram and Facebook, where it improved 11.5% of all classes it was applied to, with 73% of its recommendations accepted for production deployment. The paper details the development, deployment, and evaluation of TestGen-LLM, highlighting its advantages in providing measurable improvements and verifiable guarantees of non-regression. It also discusses the ensemble approach used to combine different LLMs and prompts, and the qualitative and quantitative results from its deployment. The primary contributions include the first industrial-scale deployment of LLM-generated code with guaranteed improvements, the evaluation of LLMs and prompts, and the insights gained from deploying TestGen-LLM in a large-scale production environment.This paper introduces TestGen-LLM, a tool developed by Meta that uses large language models (LLMs) to automatically improve existing human-written unit tests. TestGen-LLM ensures that the generated test cases meet specific criteria, such as being buildable, passing reliable tests, and increasing coverage, thereby eliminating issues related to LLM hallucination. The tool has been deployed at Meta's test-a-thons for Instagram and Facebook, where it improved 11.5% of all classes it was applied to, with 73% of its recommendations accepted for production deployment. The paper details the development, deployment, and evaluation of TestGen-LLM, highlighting its advantages in providing measurable improvements and verifiable guarantees of non-regression. It also discusses the ensemble approach used to combine different LLMs and prompts, and the qualitative and quantitative results from its deployment. The primary contributions include the first industrial-scale deployment of LLM-generated code with guaranteed improvements, the evaluation of LLMs and prompts, and the insights gained from deploying TestGen-LLM in a large-scale production environment.

Automated Unit Test Improvement using Large Language Models at Meta

November 15–19, 2024 | Nadia Alshahwan*, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang