[slides and audio] Vidur%3A A Large-Scale Simulation Framework For LLM Inference

Vidur is a large-scale, high-fidelity, and easily extensible simulation framework designed to optimize the deployment of Large Language Models (LLMs). It addresses the challenge of expensive experimentation in LLM inference by providing a simulator that models the performance of LLM operators using a combination of experimental profiling and predictive modeling. Vidur evaluates end-to-end inference performance for different workloads by estimating metrics such as latency and throughput. The framework is validated across various LLMs and hardware configurations, showing an error rate of less than 9% in estimating inference latency. Vidur-Search, a configuration search tool integrated with Vidur, helps optimize LLM deployment by identifying the most cost-effective configuration that meets performance constraints. For example, Vidur-Search can find the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, compared to 42K GPU hours on actual hardware, costing $218K. The paper also introduces Vidur-Bench, a benchmark suite that includes various workload patterns, scheduling policies, and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs. This benchmark suite supports easy evaluation of LLM inference systems and tuning of performance-sensitive parameters. The evaluation demonstrates the fidelity of Vidur across different models, hardware configurations, and workloads, showing high accuracy in predicting end-to-end performance metrics. Additionally, Vidur-Search enables what-if analysis, helping understand how performance and cost are impacted by changes in workload and deployment configurations.Vidur is a large-scale, high-fidelity, and easily extensible simulation framework designed to optimize the deployment of Large Language Models (LLMs). It addresses the challenge of expensive experimentation in LLM inference by providing a simulator that models the performance of LLM operators using a combination of experimental profiling and predictive modeling. Vidur evaluates end-to-end inference performance for different workloads by estimating metrics such as latency and throughput. The framework is validated across various LLMs and hardware configurations, showing an error rate of less than 9% in estimating inference latency. Vidur-Search, a configuration search tool integrated with Vidur, helps optimize LLM deployment by identifying the most cost-effective configuration that meets performance constraints. For example, Vidur-Search can find the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, compared to 42K GPU hours on actual hardware, costing $218K. The paper also introduces Vidur-Bench, a benchmark suite that includes various workload patterns, scheduling policies, and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs. This benchmark suite supports easy evaluation of LLM inference systems and tuning of performance-sensitive parameters. The evaluation demonstrates the fidelity of Vidur across different models, hardware configurations, and workloads, showing high accuracy in predicting end-to-end performance metrics. Additionally, Vidur-Search enables what-if analysis, helping understand how performance and cost are impacted by changes in workload and deployment configurations.

VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE

2024 | Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov