The paper introduces the Copilot Evaluation Harness, a comprehensive framework for evaluating the performance of Large Language Models (LLMs) in Integrated Development Environments (IDEs). The harness is designed to assess LLM-guided interactions across various programming scenarios and languages, including code generation, documentation generation, bug fixing, test case generation, and workspace understanding. The authors propose new metrics that are more robust and information-dense compared to previous evaluation systems. They evaluate three prominent LLMs—OpenAI's GPT-3.5, GPT-4, and Code Llama—using these metrics, providing insights into their adaptability and effectiveness in real-world development tasks. The evaluation framework allows for the tuning of IDE parameters to optimize LLM integration, ensuring superior performance in various programming contexts. The paper also discusses the limitations of existing evaluation methods and highlights the need for a more comprehensive approach to assess LLMs in software engineering tasks.The paper introduces the Copilot Evaluation Harness, a comprehensive framework for evaluating the performance of Large Language Models (LLMs) in Integrated Development Environments (IDEs). The harness is designed to assess LLM-guided interactions across various programming scenarios and languages, including code generation, documentation generation, bug fixing, test case generation, and workspace understanding. The authors propose new metrics that are more robust and information-dense compared to previous evaluation systems. They evaluate three prominent LLMs—OpenAI's GPT-3.5, GPT-4, and Code Llama—using these metrics, providing insights into their adaptability and effectiveness in real-world development tasks. The evaluation framework allows for the tuning of IDE parameters to optimize LLM integration, ensuring superior performance in various programming contexts. The paper also discusses the limitations of existing evaluation methods and highlights the need for a more comprehensive approach to assess LLMs in software engineering tasks.