The paper "Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services" by Jiachen Liu addresses the issue of optimizing user experience in large language model (LLM)-based text streaming services. Traditional serving systems focus on server-side metrics like token generation throughput, neglecting individual user experiences. This paper defines Quality-of-Experience (QoE) for text streaming services, which considers the end-to-end token delivery process and user interaction. The proposed system, Andes, is a QoE-aware serving system that strategically allocates GPU resources among multiple requests to enhance user experience. Andes uses a dynamic priority-based preemptive scheduler to optimize QoE, improving it by up to 3.2 times under high request rates or achieving 1.6 times higher request rates while maintaining high QoE. The paper evaluates Andes using OPT models and demonstrates its effectiveness in various workloads and setups, showing significant improvements in QoE and system capacity without increasing resource costs.The paper "Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services" by Jiachen Liu addresses the issue of optimizing user experience in large language model (LLM)-based text streaming services. Traditional serving systems focus on server-side metrics like token generation throughput, neglecting individual user experiences. This paper defines Quality-of-Experience (QoE) for text streaming services, which considers the end-to-end token delivery process and user interaction. The proposed system, Andes, is a QoE-aware serving system that strategically allocates GPU resources among multiple requests to enhance user experience. Andes uses a dynamic priority-based preemptive scheduler to optimize QoE, improving it by up to 3.2 times under high request rates or achieving 1.6 times higher request rates while maintaining high QoE. The paper evaluates Andes using OPT models and demonstrates its effectiveness in various workloads and setups, showing significant improvements in QoE and system capacity without increasing resource costs.