WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

7 Jun 2024 | Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Ronan Le Bras, Yejin Choi
WildBench is an automated evaluation framework for benchmarking large language models (LLMs) using challenging, real-world user queries. It consists of 1,024 tasks selected from over one million human-chatbot conversation logs. The framework uses two metrics: WB-Reward and WB-Score. WB-Reward evaluates responses through pairwise comparisons, generating five possible outcomes, while WB-Score assesses individual response quality. WB-Reward achieves a strong correlation (0.98) with human-voted Elo ratings from Chatbot Arena, surpassing other benchmarks. WB-Score also shows high correlation (0.95) with human evaluations. The tasks are curated from the WildChat dataset, ensuring diversity and real-world relevance. The framework includes task-specific checklists for systematic evaluation and mitigates length bias through a simple penalty method. WildBench provides a dynamic, in-the-wild benchmark that reflects real user interactions and is updated regularly. It aims to offer a realistic, dynamic, and contamination-resilient evaluation framework for LLMs, with extensive experiments showing strong correlations with human judgments. The benchmark includes a detailed leaderboard and evaluation methods to assess model performance across various tasks.WildBench is an automated evaluation framework for benchmarking large language models (LLMs) using challenging, real-world user queries. It consists of 1,024 tasks selected from over one million human-chatbot conversation logs. The framework uses two metrics: WB-Reward and WB-Score. WB-Reward evaluates responses through pairwise comparisons, generating five possible outcomes, while WB-Score assesses individual response quality. WB-Reward achieves a strong correlation (0.98) with human-voted Elo ratings from Chatbot Arena, surpassing other benchmarks. WB-Score also shows high correlation (0.95) with human evaluations. The tasks are curated from the WildChat dataset, ensuring diversity and real-world relevance. The framework includes task-specific checklists for systematic evaluation and mitigates length bias through a simple penalty method. WildBench provides a dynamic, in-the-wild benchmark that reflects real user interactions and is updated regularly. It aims to offer a realistic, dynamic, and contamination-resilient evaluation framework for LLMs, with extensive experiments showing strong correlations with human judgments. The benchmark includes a detailed leaderboard and evaluation methods to assess model performance across various tasks.
Reach us at info@study.space
[slides] WildBench%3A Benchmarking LLMs with Challenging Tasks from Real Users in the Wild | StudySpace