Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services

25 Apr 2024 | Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury
Andes is a QoE-aware serving system designed to enhance user experience in LLM-based text streaming services. The paper defines QoE for text streaming services by comparing the actual token delivery timeline (TDT) with the expected TDT, capturing the user's interaction with the service. Andes employs a dynamic priority-based preemptive scheduler that operates at the granularity of tokens, strategically allocating system resources to more urgent requests and preempting requests that have already received sufficient service to enhance QoE. Andes also co-designs a client-side token buffer that temporarily withholds excess tokens and displays them to the user at their expected pace, ensuring smooth token delivery. Andes improves the average QoE by up to 3.2× under high request rate or achieves up to 1.6× higher request rate while preserving high QoE. The system is evaluated using the OPT family of models, ranging from 13B to 175B parameters. Andes outperforms existing serving systems like vLLM in terms of QoE and request rate. The system is designed to handle dynamic and unpredictable resource demand, constrained resource supply, and to optimize user experience by balancing prolonged TTFT and excessively fast token generation speed. Andes addresses the limitations of existing solutions by introducing a QoE-aware scheduling policy that prioritizes requests based on their QoE requirements and resource demand. This policy ensures that requests with stringent TTFT requirements are prioritized, while monitoring the resource demand of each request to prevent small requests from being starved of necessary resources. The system also supports preemption mechanisms to manage resource allocation efficiently. The paper evaluates Andes under different workloads and setups, showing that it significantly improves QoE with negligible system overhead. Andes maintains similar token generation throughput as the baseline, with a minor drop in throughput as the request rate increases. The system also improves TTFT while maintaining TDS above user expected speed. Andes outperforms baselines across different workloads and setups, demonstrating its effectiveness in enhancing user experience in LLM-based text streaming services.Andes is a QoE-aware serving system designed to enhance user experience in LLM-based text streaming services. The paper defines QoE for text streaming services by comparing the actual token delivery timeline (TDT) with the expected TDT, capturing the user's interaction with the service. Andes employs a dynamic priority-based preemptive scheduler that operates at the granularity of tokens, strategically allocating system resources to more urgent requests and preempting requests that have already received sufficient service to enhance QoE. Andes also co-designs a client-side token buffer that temporarily withholds excess tokens and displays them to the user at their expected pace, ensuring smooth token delivery. Andes improves the average QoE by up to 3.2× under high request rate or achieves up to 1.6× higher request rate while preserving high QoE. The system is evaluated using the OPT family of models, ranging from 13B to 175B parameters. Andes outperforms existing serving systems like vLLM in terms of QoE and request rate. The system is designed to handle dynamic and unpredictable resource demand, constrained resource supply, and to optimize user experience by balancing prolonged TTFT and excessively fast token generation speed. Andes addresses the limitations of existing solutions by introducing a QoE-aware scheduling policy that prioritizes requests based on their QoE requirements and resource demand. This policy ensures that requests with stringent TTFT requirements are prioritized, while monitoring the resource demand of each request to prevent small requests from being starved of necessary resources. The system also supports preemption mechanisms to manage resource allocation efficiently. The paper evaluates Andes under different workloads and setups, showing that it significantly improves QoE with negligible system overhead. Andes maintains similar token generation throughput as the baseline, with a minor drop in throughput as the request rate increases. The system also improves TTFT while maintaining TDS above user expected speed. Andes outperforms baselines across different workloads and setups, demonstrating its effectiveness in enhancing user experience in LLM-based text streaming services.
Reach us at info@study.space