30 May 2024 | Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu
Parrot is an LLM service system that improves the end-to-end performance of LLM-based applications by exposing application-level knowledge to public LLM services. It introduces Semantic Variables, a unified abstraction that annotates input/output variables in prompts, enabling data pipeline creation and revealing correlations between LLM requests. This allows public LLM services to perform conventional data flow analysis and optimize end-to-end performance. Parrot's design includes a manager that schedules LLM requests at a cluster level, using application-level knowledge to optimize performance. It also supports sharing prompt prefixes, reducing redundant computations, and enables application-centric scheduling to meet diverse performance objectives. Parrot's optimizations include serving dependent requests efficiently, deducing performance objectives, sharing prompt prefixes, and application-centric scheduling. Evaluations show that Parrot achieves up to 11.7× speedup or 12× higher throughput compared to state-of-the-art solutions. Parrot is implemented in Python with a front-end and manager, and an LLM engine based on efficient kernels from vLLM, xFormers, and others. It provides APIs for submitting and retrieving Semantic Variables, and supports kernel optimization to accelerate attention decoding. Parrot's design enables new optimization opportunities for LLM-based applications, including handling dynamic applications, job failures, fairness, and heterogeneous clusters. The system is compatible with existing LLM orchestration frameworks and can be extended to support new optimizations.Parrot is an LLM service system that improves the end-to-end performance of LLM-based applications by exposing application-level knowledge to public LLM services. It introduces Semantic Variables, a unified abstraction that annotates input/output variables in prompts, enabling data pipeline creation and revealing correlations between LLM requests. This allows public LLM services to perform conventional data flow analysis and optimize end-to-end performance. Parrot's design includes a manager that schedules LLM requests at a cluster level, using application-level knowledge to optimize performance. It also supports sharing prompt prefixes, reducing redundant computations, and enables application-centric scheduling to meet diverse performance objectives. Parrot's optimizations include serving dependent requests efficiently, deducing performance objectives, sharing prompt prefixes, and application-centric scheduling. Evaluations show that Parrot achieves up to 11.7× speedup or 12× higher throughput compared to state-of-the-art solutions. Parrot is implemented in Python with a front-end and manager, and an LLM engine based on efficient kernels from vLLM, xFormers, and others. It provides APIs for submitting and retrieving Semantic Variables, and supports kernel optimization to accelerate attention decoding. Parrot's design enables new optimization opportunities for LLM-based applications, including handling dynamic applications, job failures, fairness, and heterogeneous clusters. The system is compatible with existing LLM orchestration frameworks and can be extended to support new optimizations.