17 Jun 2024 | Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. González, Ion Stoica
The paper introduces BenchBuilder, a data curation pipeline designed to automatically construct high-quality benchmarks from crowdsourced data, specifically from the Chatbot Arena platform. The goal is to address the limitations of static benchmarks, which often struggle to distinguish between different models and align with real-world user preferences. BenchBuilder identifies seven key indicators of high-quality prompts, such as specificity, domain knowledge, and complexity, and uses an LLM annotator to select a high-quality subset of prompts from various topic clusters. The resulting benchmark, Arena-Hard-Auto v0.1, employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. Arena-Hard-Auto v0.1 offers 5x tighter confidence intervals than MT-Bench and achieves 89.1% agreement with human preference rankings, all at a cost of only $25 per evaluation. The paper also introduces metrics to assess the quality of benchmarks, including separability with confidence, agreement with confidence interval, and pair rank Brier score. The evaluation results show that Arena-Hard-Auto v0.1 significantly outperforms existing benchmarks in terms of separability and alignment with human preferences. The authors conclude that BenchBuilder is a valuable tool for developers seeking to extract high-quality benchmarks from extensive data with minimal effort.The paper introduces BenchBuilder, a data curation pipeline designed to automatically construct high-quality benchmarks from crowdsourced data, specifically from the Chatbot Arena platform. The goal is to address the limitations of static benchmarks, which often struggle to distinguish between different models and align with real-world user preferences. BenchBuilder identifies seven key indicators of high-quality prompts, such as specificity, domain knowledge, and complexity, and uses an LLM annotator to select a high-quality subset of prompts from various topic clusters. The resulting benchmark, Arena-Hard-Auto v0.1, employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. Arena-Hard-Auto v0.1 offers 5x tighter confidence intervals than MT-Bench and achieves 89.1% agreement with human preference rankings, all at a cost of only $25 per evaluation. The paper also introduces metrics to assess the quality of benchmarks, including separability with confidence, agreement with confidence interval, and pair rank Brier score. The evaluation results show that Arena-Hard-Auto v0.1 significantly outperforms existing benchmarks in terms of separability and alignment with human preferences. The authors conclude that BenchBuilder is a valuable tool for developers seeking to extract high-quality benchmarks from extensive data with minimal effort.