April 27-May 1, 2024 | Sohaib Ahmad, Hui Guan, Brian D. Friedman, Thomas Williams, Ramesh K. Sitaraman
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
This paper presents Proteus, a high-throughput inference-serving system that leverages accuracy scaling to meet varying query demands while maximizing system accuracy. Unlike traditional systems that rely on hardware scaling, Proteus adapts model accuracy instead of hardware resources to handle query demands. The system jointly optimizes three sub-problems: model selection, model placement, and query assignment. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Proteus is evaluated on real-world and synthetic traces, showing that it reduces accuracy drop by up to 3× and latency timeouts by 2-10× compared to baseline schemes while meeting throughput requirements.
The system is designed to handle varying query demands by selecting the appropriate model variants, placing them on heterogeneous devices, and assigning query workloads to each device. The adaptive batching algorithm dynamically determines the optimal batch size based on queue conditions to minimize SLO violations. Proteus decouples the control and data paths of inference-serving to perform optimal resource allocation asynchronously from query serving. This allows the system to handle micro-scale variations in query arrival times effectively.
Proteus is evaluated against three state-of-the-art inference-serving systems, INFaaS, Sommelier, and Clipper, using query workloads derived from a real-world Twitter trace and synthetic traces. The experiments show that Proteus improves system throughput by 60% and reduces SLO violations by 10× due to accuracy scaling. Compared to baselines that also scale accuracy, Proteus minimizes accuracy drop by up to 3.2× and reduces SLO violations by up to 4.3× due to better resource allocation and batching algorithms.
The key contributions of this work include: (1) a theoretical framework for resource management of an inference-serving system that exploits accuracy scaling to ensure that the system throughput is sufficient to meet the query demand while maximizing system accuracy; (2) a proactive adaptive batching algorithm that can handle query load fluctuations effectively via a non-work-conserving approach; and (3) the design of Proteus, a high-throughput inference-serving system with accuracy scaling. Proteus is the first system to study accuracy scaling in a cluster setting. The system is evaluated on a production system used by actual users within a large enterprise and shows that it reduces accuracy drop by up to 3.2× and SLO violations by 2.8-10× compared to state-of-the-art baselines while meeting throughput requirements. The simulation results closely match the results from the production system.Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
This paper presents Proteus, a high-throughput inference-serving system that leverages accuracy scaling to meet varying query demands while maximizing system accuracy. Unlike traditional systems that rely on hardware scaling, Proteus adapts model accuracy instead of hardware resources to handle query demands. The system jointly optimizes three sub-problems: model selection, model placement, and query assignment. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Proteus is evaluated on real-world and synthetic traces, showing that it reduces accuracy drop by up to 3× and latency timeouts by 2-10× compared to baseline schemes while meeting throughput requirements.
The system is designed to handle varying query demands by selecting the appropriate model variants, placing them on heterogeneous devices, and assigning query workloads to each device. The adaptive batching algorithm dynamically determines the optimal batch size based on queue conditions to minimize SLO violations. Proteus decouples the control and data paths of inference-serving to perform optimal resource allocation asynchronously from query serving. This allows the system to handle micro-scale variations in query arrival times effectively.
Proteus is evaluated against three state-of-the-art inference-serving systems, INFaaS, Sommelier, and Clipper, using query workloads derived from a real-world Twitter trace and synthetic traces. The experiments show that Proteus improves system throughput by 60% and reduces SLO violations by 10× due to accuracy scaling. Compared to baselines that also scale accuracy, Proteus minimizes accuracy drop by up to 3.2× and reduces SLO violations by up to 4.3× due to better resource allocation and batching algorithms.
The key contributions of this work include: (1) a theoretical framework for resource management of an inference-serving system that exploits accuracy scaling to ensure that the system throughput is sufficient to meet the query demand while maximizing system accuracy; (2) a proactive adaptive batching algorithm that can handle query load fluctuations effectively via a non-work-conserving approach; and (3) the design of Proteus, a high-throughput inference-serving system with accuracy scaling. Proteus is the first system to study accuracy scaling in a cluster setting. The system is evaluated on a production system used by actual users within a large enterprise and shows that it reduces accuracy drop by up to 3.2× and SLO violations by 2.8-10× compared to state-of-the-art baselines while meeting throughput requirements. The simulation results closely match the results from the production system.