Feb 2024 | Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarakcal, Minsuk Chang, Michael Terry, Lucas Dixon
LLM Comparator is a novel interactive visual analytics tool designed to help researchers and engineers analyze the results of automatic side-by-side evaluations of large language models (LLMs). The tool enables users to compare the performance of two models, understand when and why one performs better than the other, and explore how their responses differ. It provides an interactive table for detailed inspection of individual examples and a visualization summary that supports analytical workflows. The tool was developed through iterative collaboration with researchers and engineers at Google, and has been successfully integrated into evaluation pipelines for large teams. It has attracted over 400 users within three months, facilitating the analysis of over 1,000 distinct side-by-side experiments. The tool includes features such as overlapping word highlights, rationale summaries, color coding, and visualization of score distributions, win rates, and rationale clusters. It also supports n-gram analysis and custom functions for deeper insights into model responses. An observational study with six participants revealed that the tool helps users form hypotheses about automatic ratings, verify known model behaviors, and analyze qualitative differences between model responses. The study also highlighted the need for further improvements, such as LLM-based custom metrics, pre-configured undesirable patterns, and enhanced rationale clustering. The tool is part of a broader effort to improve the interpretability and scalability of LLM evaluations.LLM Comparator is a novel interactive visual analytics tool designed to help researchers and engineers analyze the results of automatic side-by-side evaluations of large language models (LLMs). The tool enables users to compare the performance of two models, understand when and why one performs better than the other, and explore how their responses differ. It provides an interactive table for detailed inspection of individual examples and a visualization summary that supports analytical workflows. The tool was developed through iterative collaboration with researchers and engineers at Google, and has been successfully integrated into evaluation pipelines for large teams. It has attracted over 400 users within three months, facilitating the analysis of over 1,000 distinct side-by-side experiments. The tool includes features such as overlapping word highlights, rationale summaries, color coding, and visualization of score distributions, win rates, and rationale clusters. It also supports n-gram analysis and custom functions for deeper insights into model responses. An observational study with six participants revealed that the tool helps users form hypotheses about automatic ratings, verify known model behaviors, and analyze qualitative differences between model responses. The study also highlighted the need for further improvements, such as LLM-based custom metrics, pre-configured undesirable patterns, and enhanced rationale clustering. The tool is part of a broader effort to improve the interpretability and scalability of LLM evaluations.