Large Language Model Evaluation Via Multi AI Agents: Preliminary Results

Large Language Model Evaluation Via Multi AI Agents: Preliminary Results

1 Apr 2024 | Zeeshan Rasheed, Muhammad Waseem, Kari Systä & Pekka Abrahamsson
This paper introduces a novel multi-agent AI model for evaluating the performance of various Large Language Models (LLMs) in code generation. The model consists of eight AI agents, each interacting with a different LLM to retrieve code based on a common description. The agents work with models such as GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Google Bard, LLAMA, and Hugging Face. A verification agent evaluates the generated code using the HumanEval benchmark, assessing attributes like syntactic correctness, adherence to prompts, and computational efficiency. The model uses the pass@k metric to evaluate the accuracy of generated code. Initial results show that GPT-3.5 Turbo performs better than other models, achieving a 70% accuracy rate in code generation. GPT-4 Turbo follows closely, with 6 out of 10 accurate results. The study highlights the importance of evaluating LLMs not only for their technical capabilities but also for their societal impact and potential risks. Future work includes integrating the Massively Multitask Benchmark for Python (MBPP) to refine evaluations and expanding the input descriptions to 50 for more comprehensive analysis. The authors also plan to share the model with twenty practitioners to gather feedback and improve the model further. The study contributes to the ongoing discourse on the practical applications of LLMs and aims to guide stakeholders in making informed decisions when integrating these models into development workflows. The research underscores the need for robust evaluation frameworks and highlights the potential of LLMs in automating code generation and software development.This paper introduces a novel multi-agent AI model for evaluating the performance of various Large Language Models (LLMs) in code generation. The model consists of eight AI agents, each interacting with a different LLM to retrieve code based on a common description. The agents work with models such as GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Google Bard, LLAMA, and Hugging Face. A verification agent evaluates the generated code using the HumanEval benchmark, assessing attributes like syntactic correctness, adherence to prompts, and computational efficiency. The model uses the pass@k metric to evaluate the accuracy of generated code. Initial results show that GPT-3.5 Turbo performs better than other models, achieving a 70% accuracy rate in code generation. GPT-4 Turbo follows closely, with 6 out of 10 accurate results. The study highlights the importance of evaluating LLMs not only for their technical capabilities but also for their societal impact and potential risks. Future work includes integrating the Massively Multitask Benchmark for Python (MBPP) to refine evaluations and expanding the input descriptions to 50 for more comprehensive analysis. The authors also plan to share the model with twenty practitioners to gather feedback and improve the model further. The study contributes to the ongoing discourse on the practical applications of LLMs and aims to guide stakeholders in making informed decisions when integrating these models into development workflows. The research underscores the need for robust evaluation frameworks and highlights the potential of LLMs in automating code generation and software development.
Reach us at info@study.space
[slides] Large Language Model Evaluation Via Multi AI Agents%3A Preliminary results | StudySpace