Vidur is a large-scale simulation framework for LLM inference that provides high-fidelity performance predictions. It models LLM operators using experimental profiling and predictive modeling, and evaluates end-to-end inference performance for different workloads by estimating metrics such as latency and throughput. Vidur-Search is a configuration search tool that helps optimize LLM deployment by automatically identifying the most cost-effective configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours – costing 218K dollars. The framework is validated across a range of models, hardware and cluster configurations, and accurately predicts request-level LLM inference performance with under 9% error rate. Vidur-Bench is a benchmark suite comprising of various workload patterns, schedulers and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs. Vidur-Search helps LLM inference providers optimize their deployment by identifying the highest throughput per dollar configuration. The paper makes the following contributions: Vidur, an LLM inference simulator that predicts key performance metrics of interest with high-fidelity; Vidur-Bench, a benchmark suite comprising of various workload patterns, schedulers and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs; and Vidur-Search, a configuration search tool that helps optimize deployment by identifying the highest throughput per dollar configuration. The paper also discusses the challenges in simulating LLM inference, including the need for high-fidelity predictions, varying iteration times, and cascading errors. The framework is evaluated across a wide range of models, hardware configurations and workloads, demonstrating its fidelity and usefulness. The evaluation shows that Vidur achieves high fidelity in almost all scenarios with request rate set to 85% of the system capacity. The paper also presents a what-if analysis using Vidur-Search to understand how the performance of a configuration changes with the workload and how the cost of serving is impacted by Service Level Objective (SLO) requirements. The results show that the optimal configuration can vary significantly depending on the workload, and that even models with similar sizes can have very different performance characteristics due to variation in architectural details. The paper concludes that Vidur is a valuable tool for LLM inference optimization, providing high-fidelity performance predictions and enabling efficient deployment strategies.Vidur is a large-scale simulation framework for LLM inference that provides high-fidelity performance predictions. It models LLM operators using experimental profiling and predictive modeling, and evaluates end-to-end inference performance for different workloads by estimating metrics such as latency and throughput. Vidur-Search is a configuration search tool that helps optimize LLM deployment by automatically identifying the most cost-effective configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours – costing 218K dollars. The framework is validated across a range of models, hardware and cluster configurations, and accurately predicts request-level LLM inference performance with under 9% error rate. Vidur-Bench is a benchmark suite comprising of various workload patterns, schedulers and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs. Vidur-Search helps LLM inference providers optimize their deployment by identifying the highest throughput per dollar configuration. The paper makes the following contributions: Vidur, an LLM inference simulator that predicts key performance metrics of interest with high-fidelity; Vidur-Bench, a benchmark suite comprising of various workload patterns, schedulers and serving frameworks, along with profiling information for popular hardware like A100 and H100 GPUs; and Vidur-Search, a configuration search tool that helps optimize deployment by identifying the highest throughput per dollar configuration. The paper also discusses the challenges in simulating LLM inference, including the need for high-fidelity predictions, varying iteration times, and cascading errors. The framework is evaluated across a wide range of models, hardware configurations and workloads, demonstrating its fidelity and usefulness. The evaluation shows that Vidur achieves high fidelity in almost all scenarios with request rate set to 85% of the system capacity. The paper also presents a what-if analysis using Vidur-Search to understand how the performance of a configuration changes with the workload and how the cost of serving is impacted by Service Level Objective (SLO) requirements. The results show that the optimal configuration can vary significantly depending on the workload, and that even models with similar sizes can have very different performance characteristics due to variation in architectural details. The paper concludes that Vidur is a valuable tool for LLM inference optimization, providing high-fidelity performance predictions and enabling efficient deployment strategies.