Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

April 27-May 1, 2024, La Jolla, CA, USA | Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitaraman, Thomas Woo
Proteus is a high-throughput inference-serving system designed to handle varying query demands by leveraging accuracy scaling, which adapts the accuracy of ML models instead of increasing hardware resources. The system aims to maximize throughput while maintaining high accuracy. To achieve this, Proteus addresses three key challenges: model selection, model placement, and query assignment. It uses a mixed integer linear programming (MILP) framework to optimize resource allocation based on target query demands. Additionally, Proteus introduces an adaptive batching algorithm to handle variations in query arrival times, minimizing SLO violations. Empirical evaluations on real-world and synthetic traces show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× compared to baseline schemes, while meeting throughput requirements. The system's effectiveness is demonstrated through its ability to improve throughput by 60% and reduce SLO violations by 10×, compared to baselines that do not scale accuracy.Proteus is a high-throughput inference-serving system designed to handle varying query demands by leveraging accuracy scaling, which adapts the accuracy of ML models instead of increasing hardware resources. The system aims to maximize throughput while maintaining high accuracy. To achieve this, Proteus addresses three key challenges: model selection, model placement, and query assignment. It uses a mixed integer linear programming (MILP) framework to optimize resource allocation based on target query demands. Additionally, Proteus introduces an adaptive batching algorithm to handle variations in query arrival times, minimizing SLO violations. Empirical evaluations on real-world and synthetic traces show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× compared to baseline schemes, while meeting throughput requirements. The system's effectiveness is demonstrated through its ability to improve throughput by 60% and reduce SLO violations by 10×, compared to baselines that do not scale accuracy.
Reach us at info@study.space