14 May 2024 | Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele
This paper explores the capabilities of large language models (LLMs) in generating parallel code, a task that is crucial for modern software development due to the prevalence of multi-core processors, GPUs, and distributed systems. The authors introduce ParEVAL, a benchmark consisting of 420 prompts covering 12 different computational problem types and 7 parallel programming models. They evaluate several state-of-the-art open- and closed-source LLMs using this benchmark and introduce novel metrics for assessing the correctness and performance of the generated code. The results show that while all tested LLMs struggle with parallel code generation, GPT-3.5 performs the best with a pass@1 score of 76.0 for serial code and 39.6 for parallel code. The study also identifies that LLMs perform poorly with MPI code generation and find it challenging to handle sparse, unstructured problems. Additionally, the paper discusses the scalability and performance of the generated code, noting that the best-performing LLMs do not necessarily generate the most efficient code. The findings highlight the areas where current LLMs can be improved and provide insights into the limitations and potential of LLMs in parallel code generation.This paper explores the capabilities of large language models (LLMs) in generating parallel code, a task that is crucial for modern software development due to the prevalence of multi-core processors, GPUs, and distributed systems. The authors introduce ParEVAL, a benchmark consisting of 420 prompts covering 12 different computational problem types and 7 parallel programming models. They evaluate several state-of-the-art open- and closed-source LLMs using this benchmark and introduce novel metrics for assessing the correctness and performance of the generated code. The results show that while all tested LLMs struggle with parallel code generation, GPT-3.5 performs the best with a pass@1 score of 76.0 for serial code and 39.6 for parallel code. The study also identifies that LLMs perform poorly with MPI code generation and find it challenging to handle sparse, unstructured problems. Additionally, the paper discusses the scalability and performance of the generated code, noting that the best-performing LLMs do not necessarily generate the most efficient code. The findings highlight the areas where current LLMs can be improved and provide insights into the limitations and potential of LLMs in parallel code generation.