[slides] WildBench%3A Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

**WildBench** is an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. The framework consists of 1,024 tasks selected from over one million human-chatbot conversation logs. To evaluate these tasks, WildBench introduces two metrics: WB-Reward and WB-Score. WB-Reward uses pairwise comparisons between model responses, with five potential outcomes, and employs three baseline models to ensure a comprehensive evaluation. WB-Score individually evaluates the quality of each model's output, providing a fast and cost-efficient metric. The evaluation process includes task-specific checklists to systematically assess model outputs and structured explanations to justify scores and comparisons. WildBench demonstrates strong correlations with human-voted Elo ratings from Chatbot Arena, achieving a Pearson correlation of 0.98 for WB-Reward and 0.95 for WB-Score. The benchmark is dynamic and regularly updated to reflect new types of user interactions, ensuring it remains relevant and reflective of the evolving capabilities of LLMs.**WildBench** is an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. The framework consists of 1,024 tasks selected from over one million human-chatbot conversation logs. To evaluate these tasks, WildBench introduces two metrics: WB-Reward and WB-Score. WB-Reward uses pairwise comparisons between model responses, with five potential outcomes, and employs three baseline models to ensure a comprehensive evaluation. WB-Score individually evaluates the quality of each model's output, providing a fast and cost-efficient metric. The evaluation process includes task-specific checklists to systematically assess model outputs and structured explanations to justify scores and comparisons. WildBench demonstrates strong correlations with human-voted Elo ratings from Chatbot Arena, achieving a Pearson correlation of 0.98 for WB-Reward and 0.95 for WB-Score. The benchmark is dynamic and regularly updated to reflect new types of user interactions, ensuring it remains relevant and reflective of the evolving capabilities of LLMs.

WILD BENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

7 Jun 2024 | Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi