17 Jun 2024 | Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. González, Ion Stoica
The paper introduces BenchBuilder, a data pipeline that automatically constructs high-quality benchmarks from live crowdsourced data, such as the Chatbot Arena. BenchBuilder identifies seven key qualities of high-quality prompts, including specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. It uses an LLM annotator to select prompts that meet these criteria, resulting in a benchmark called Arena-Hard-Auto v0.1, which contains 500 challenging prompts from a wide range of tasks. This benchmark achieves 89.1% agreement with human preference rankings and offers significantly better separability than existing benchmarks like MT-Bench, with 3x tighter confidence intervals. BenchBuilder is fully automated, requiring only $25 and no human labelers. The pipeline ensures that benchmarks remain up-to-date and relevant, avoiding overfitting and test leakage. The paper also proposes three novel metrics to evaluate benchmark quality: Separability with Confidence, Agreement with Confidence Interval, and Pair Rank Brier Score. These metrics provide a comprehensive assessment of benchmark performance, balancing the need for clear differentiation with alignment to human preferences. The results show that Arena-Hard-Auto v0.1 is a cost-effective and informative evaluation benchmark, and BenchBuilder serves as a valuable tool for developers to automatically generate high-quality benchmarks from vast data sources.The paper introduces BenchBuilder, a data pipeline that automatically constructs high-quality benchmarks from live crowdsourced data, such as the Chatbot Arena. BenchBuilder identifies seven key qualities of high-quality prompts, including specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. It uses an LLM annotator to select prompts that meet these criteria, resulting in a benchmark called Arena-Hard-Auto v0.1, which contains 500 challenging prompts from a wide range of tasks. This benchmark achieves 89.1% agreement with human preference rankings and offers significantly better separability than existing benchmarks like MT-Bench, with 3x tighter confidence intervals. BenchBuilder is fully automated, requiring only $25 and no human labelers. The pipeline ensures that benchmarks remain up-to-date and relevant, avoiding overfitting and test leakage. The paper also proposes three novel metrics to evaluate benchmark quality: Separability with Confidence, Agreement with Confidence Interval, and Pair Rank Brier Score. These metrics provide a comprehensive assessment of benchmark performance, balancing the need for clear differentiation with alignment to human preferences. The results show that Arena-Hard-Auto v0.1 is a cost-effective and informative evaluation benchmark, and BenchBuilder serves as a valuable tool for developers to automatically generate high-quality benchmarks from vast data sources.