INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference

INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference

2024 | Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang
INFERCEPT is an efficient LLM inference framework designed for augmented LLMs with interception support. It minimizes GPU memory waste caused by LLM interceptions and dedicates saved memory to serve more requests. INFERCEPT improves serving throughput by 1.6×–2× and completes 2× more requests per second compared to state-of-the-art systems. The framework addresses the inefficiencies of existing LLM inference systems that treat interceptions as termination signals, leading to unnecessary recomputation and memory waste. INFERCEPT introduces three key contributions: (1) equations to quantify GPU memory waste for different interception strategies, (2) improved interception techniques to reduce memory waste, and (3) dynamic scheduling to minimize overall GPU memory waste while ensuring fairness. INFERCEPT is implemented on top of vLLM and evaluated on three LLMs and six interception types. It sustains 1.6×–2× higher serving load than vLLM while maintaining similar latency per token generation and achieves over 2× more completed requests per second. INFERCEPT is available at https://github.com/WukLab/InferCept. The paper discusses the challenges of handling LLM interceptions, including varying interception times, context length variations, and the need for efficient memory management. INFERCEPT's techniques include swap pipelining, recomputation chunking, and dynamic scheduling to optimize performance. The framework is evaluated on various workloads, including mixed and single-API tasks, demonstrating significant improvements in throughput and latency. INFERCEPT's effectiveness is validated through extensive experiments, showing its ability to handle diverse LLM augmentation scenarios efficiently.INFERCEPT is an efficient LLM inference framework designed for augmented LLMs with interception support. It minimizes GPU memory waste caused by LLM interceptions and dedicates saved memory to serve more requests. INFERCEPT improves serving throughput by 1.6×–2× and completes 2× more requests per second compared to state-of-the-art systems. The framework addresses the inefficiencies of existing LLM inference systems that treat interceptions as termination signals, leading to unnecessary recomputation and memory waste. INFERCEPT introduces three key contributions: (1) equations to quantify GPU memory waste for different interception strategies, (2) improved interception techniques to reduce memory waste, and (3) dynamic scheduling to minimize overall GPU memory waste while ensuring fairness. INFERCEPT is implemented on top of vLLM and evaluated on three LLMs and six interception types. It sustains 1.6×–2× higher serving load than vLLM while maintaining similar latency per token generation and achieves over 2× more completed requests per second. INFERCEPT is available at https://github.com/WukLab/InferCept. The paper discusses the challenges of handling LLM interceptions, including varying interception times, context length variations, and the need for efficient memory management. INFERCEPT's techniques include swap pipelining, recomputation chunking, and dynamic scheduling to optimize performance. The framework is evaluated on various workloads, including mixed and single-API tasks, demonstrating significant improvements in throughput and latency. INFERCEPT's effectiveness is validated through extensive experiments, showing its ability to handle diverse LLM augmentation scenarios efficiently.
Reach us at info@study.space