Understanding Large Language Model Evaluation Via Multi AI Agents%3A Preliminary results

The paper introduces a novel multi-agent AI model designed to evaluate and compare the performance of various Large Language Models (LLMs) in code generation tasks. The model consists of eight distinct AI agents, each interacting with a different advanced language model, including GPT-3.5, GPT-4, Google Bard, LLAMA, and Hugging Face. These agents retrieve code based on common descriptions from their respective models and then evaluate the generated code using the HumanEval benchmark. Initial results indicate that GPT-3.5 Turbo outperforms other models in terms of accuracy, achieving a 70% success rate in generating accurate code for 10 common high-level input descriptions. The authors plan to enhance the evaluation process by incorporating the Massively Multitask Benchmark for Python (MBPP) and engaging with twenty practitioners from diverse backgrounds to gather feedback and improve the model's practical applicability. The paper highlights the importance of rigorous evaluation in understanding the capabilities and potential risks of LLMs, particularly in software engineering applications.The paper introduces a novel multi-agent AI model designed to evaluate and compare the performance of various Large Language Models (LLMs) in code generation tasks. The model consists of eight distinct AI agents, each interacting with a different advanced language model, including GPT-3.5, GPT-4, Google Bard, LLAMA, and Hugging Face. These agents retrieve code based on common descriptions from their respective models and then evaluate the generated code using the HumanEval benchmark. Initial results indicate that GPT-3.5 Turbo outperforms other models in terms of accuracy, achieving a 70% success rate in generating accurate code for 10 common high-level input descriptions. The authors plan to enhance the evaluation process by incorporating the Massively Multitask Benchmark for Python (MBPP) and engaging with twenty practitioners from diverse backgrounds to gather feedback and improve the model's practical applicability. The paper highlights the importance of rigorous evaluation in understanding the capabilities and potential risks of LLMs, particularly in software engineering applications.

LARGE LANGUAGE MODEL EVALUATION VIA MULTI AI AGENTS: PRELIMINARY RESULTS

1 Apr 2024 | Zeeshan Rasheed, Muhammad Waseem, Kari Systä & Pekka Abrahamsson