June 3–7, 2024 | Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatle
This paper investigates the ability of large language models (LLMs) to generate parallel code. The authors introduce PAREVAL, a benchmark consisting of 420 prompts that test LLMs on 12 different computational problem types and seven parallel programming models. They evaluate several state-of-the-art open- and closed-source LLMs, including GPT-3.5, GPT-4, CodeLlama, StarCoderBase, and Phind-CodeLlama-V2, using PAREVAL to assess their performance in generating correct and efficient parallel code.
The study finds that LLMs struggle with generating parallel code, particularly for MPI and sparse, unstructured problems. GPT-3.5 performs best in serial code generation, achieving a pass@1 score of 76.0, while it scores 39.6 for parallel code generation. Phind-CodeLlama-V2 performs best among open-source models, achieving a pass@1 score of 32 for parallel code generation. However, it still lags behind closed-source models by nearly 8 percentage points.
The authors introduce novel metrics for evaluating the performance of generated code, including speedup_n@k and efficiency_n@k, which measure the expected best performance speedup and efficiency of the generated code. These metrics show that the generated code often has poor parallel speedup and efficiency, even for models that generate correct code.
The study also explores the ability of LLMs to translate code between execution models. The results show that providing LLMs with correct implementations in one execution model can improve their ability to generate correct code in another execution model. However, this improvement is particularly true for smaller open-source models.
The authors conclude that while LLMs can be useful for parallel code generation, they still struggle with complex tasks such as reasoning and planning. They suggest that further research is needed to improve the capabilities of LLMs in generating parallel code.This paper investigates the ability of large language models (LLMs) to generate parallel code. The authors introduce PAREVAL, a benchmark consisting of 420 prompts that test LLMs on 12 different computational problem types and seven parallel programming models. They evaluate several state-of-the-art open- and closed-source LLMs, including GPT-3.5, GPT-4, CodeLlama, StarCoderBase, and Phind-CodeLlama-V2, using PAREVAL to assess their performance in generating correct and efficient parallel code.
The study finds that LLMs struggle with generating parallel code, particularly for MPI and sparse, unstructured problems. GPT-3.5 performs best in serial code generation, achieving a pass@1 score of 76.0, while it scores 39.6 for parallel code generation. Phind-CodeLlama-V2 performs best among open-source models, achieving a pass@1 score of 32 for parallel code generation. However, it still lags behind closed-source models by nearly 8 percentage points.
The authors introduce novel metrics for evaluating the performance of generated code, including speedup_n@k and efficiency_n@k, which measure the expected best performance speedup and efficiency of the generated code. These metrics show that the generated code often has poor parallel speedup and efficiency, even for models that generate correct code.
The study also explores the ability of LLMs to translate code between execution models. The results show that providing LLMs with correct implementations in one execution model can improve their ability to generate correct code in another execution model. However, this improvement is particularly true for smaller open-source models.
The authors conclude that while LLMs can be useful for parallel code generation, they still struggle with complex tasks such as reasoning and planning. They suggest that further research is needed to improve the capabilities of LLMs in generating parallel code.