16 Feb 2024 | Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Michael Terry, Lucas Dixon, Minsuk Chang
**LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models**
**Abstract:**
Automatic side-by-side evaluation (AutoSxS) has emerged as a promising approach to assess the quality of responses from large language models (LLMs). However, analyzing these evaluations faces scalability and interpretability challenges. This paper introduces *LLM Comparator*, an interactive visual analytics tool designed to address these issues. The tool enables users to compare and analyze the performance of two LLMs, identifying when and why one model outperforms the other, and how their responses differ qualitatively. The tool includes an interactive table for detailed inspection of individual examples and a visualization summary for overviews and filtering options. It supports slice-level performance analysis, rationale summaries, and n-gram and custom function analyses. *LLM Comparator* has been integrated into evaluation pipelines at Google, attracting over 400 users and facilitating the analysis of over 1,000 experiments.
**Keywords:**
Visual analytics, generative AI, large language models, machine learning evaluation, side-by-side evaluation
**Introduction:**
The evaluation of LLMs is crucial for model developers and researchers. Traditional methods, such as human ratings, are costly and impractical for large-scale evaluations. AutoSxS, which uses another LLM to compare responses, offers a scalable solution. However, interpreting the results remains challenging. *LLM Comparator* addresses these issues by providing interactive workflows for detailed analysis and interpretation.
**Current Workflows & Design Goals:**
The paper discusses the current practices of LLM evaluations and the design goals for *LLM Comparator*. It highlights the need for tools that facilitate interactions between aggregated information and individual examples, provide workflows to answer analytical questions, and enable large-scale analysis.
**Implementation & Deployment:**
*LLM Comparator* is a web-based application implemented in Python and TypeScript. It has been deployed in Google's evaluation pipelines, attracting over 400 users and facilitating the analysis of numerous experiments.
**Observational Study:**
An observational study with six participants from Google revealed usage patterns such as example-first deep dive, prior experience-based testing, and rationale-centric top-down exploration. Participants used the tool to form hypotheses, verify known model behaviors, and analyze qualitative differences between responses.
**Conclusion:**
*LLM Comparator* is a powerful tool for analyzing automatic side-by-side evaluations of LLMs, enabling users to understand when and why models perform differently and how their responses vary. The tool's effectiveness is supported by user feedback and an observational study.**LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models**
**Abstract:**
Automatic side-by-side evaluation (AutoSxS) has emerged as a promising approach to assess the quality of responses from large language models (LLMs). However, analyzing these evaluations faces scalability and interpretability challenges. This paper introduces *LLM Comparator*, an interactive visual analytics tool designed to address these issues. The tool enables users to compare and analyze the performance of two LLMs, identifying when and why one model outperforms the other, and how their responses differ qualitatively. The tool includes an interactive table for detailed inspection of individual examples and a visualization summary for overviews and filtering options. It supports slice-level performance analysis, rationale summaries, and n-gram and custom function analyses. *LLM Comparator* has been integrated into evaluation pipelines at Google, attracting over 400 users and facilitating the analysis of over 1,000 experiments.
**Keywords:**
Visual analytics, generative AI, large language models, machine learning evaluation, side-by-side evaluation
**Introduction:**
The evaluation of LLMs is crucial for model developers and researchers. Traditional methods, such as human ratings, are costly and impractical for large-scale evaluations. AutoSxS, which uses another LLM to compare responses, offers a scalable solution. However, interpreting the results remains challenging. *LLM Comparator* addresses these issues by providing interactive workflows for detailed analysis and interpretation.
**Current Workflows & Design Goals:**
The paper discusses the current practices of LLM evaluations and the design goals for *LLM Comparator*. It highlights the need for tools that facilitate interactions between aggregated information and individual examples, provide workflows to answer analytical questions, and enable large-scale analysis.
**Implementation & Deployment:**
*LLM Comparator* is a web-based application implemented in Python and TypeScript. It has been deployed in Google's evaluation pipelines, attracting over 400 users and facilitating the analysis of numerous experiments.
**Observational Study:**
An observational study with six participants from Google revealed usage patterns such as example-first deep dive, prior experience-based testing, and rationale-centric top-down exploration. Participants used the tool to form hypotheses, verify known model behaviors, and analyze qualitative differences between responses.
**Conclusion:**
*LLM Comparator* is a powerful tool for analyzing automatic side-by-side evaluations of LLMs, enabling users to understand when and why models perform differently and how their responses vary. The tool's effectiveness is supported by user feedback and an observational study.