[slides and audio] InferCept%3A Efficient Intercept Support for Augmented Large Language Model Inference

The paper introduces INFERCePT, an efficient inference framework designed for augmented large language models (LLMs) that support the interception of LLM generation. Traditional LLM inference systems are optimized for standalone LLMs and treat each external interaction as the end of the LLM's generation, leading to unnecessary recomputation of previously computed contexts, which accounts for 37-40% of total model forwarding time. INFERCePT aims to minimize GPU resource waste caused by these interceptions and improve overall serving throughput. Key contributions of INFERCePT include: 1. **Min-waste interception**: A core idea to minimize GPU memory waste during interceptions. 2. **Waste calculation equations**: Quantifying the memory waste of different interception handling strategies. 3. **Optimized interception techniques**: Improving individual interception methods to reduce or eliminate memory waste. 4. **Swap pipelining**: Overlapping swap and foreground computation in a model-layer-by-layer manner. 5. **Chunking for recomputation**: Splitting sequences into multiple model-forwarding iterations to manage GPU-CPU link resources. The evaluation shows that INFERCePT achieves 1.6×-2× higher serving load and completes 2× more requests per second compared to state-of-the-art LLM inference systems. The framework is implemented on top of vLLM and evaluated on three LLMs (GPT-J-6B, Vicuna-13B, and Llama3-70B) using six types of augmentations (arithmetic, question-and-answer, virtual environment, chatbot, image generation, and text-to-speech).The paper introduces INFERCePT, an efficient inference framework designed for augmented large language models (LLMs) that support the interception of LLM generation. Traditional LLM inference systems are optimized for standalone LLMs and treat each external interaction as the end of the LLM's generation, leading to unnecessary recomputation of previously computed contexts, which accounts for 37-40% of total model forwarding time. INFERCePT aims to minimize GPU resource waste caused by these interceptions and improve overall serving throughput. Key contributions of INFERCePT include: 1. **Min-waste interception**: A core idea to minimize GPU memory waste during interceptions. 2. **Waste calculation equations**: Quantifying the memory waste of different interception handling strategies. 3. **Optimized interception techniques**: Improving individual interception methods to reduce or eliminate memory waste. 4. **Swap pipelining**: Overlapping swap and foreground computation in a model-layer-by-layer manner. 5. **Chunking for recomputation**: Splitting sequences into multiple model-forwarding iterations to manage GPU-CPU link resources. The evaluation shows that INFERCePT achieves 1.6×-2× higher serving load and completes 2× more requests per second compared to state-of-the-art LLM inference systems. The framework is implemented on top of vLLM and evaluated on three LLMs (GPT-J-6B, Vicuna-13B, and Llama3-70B) using six types of augmentations (arithmetic, question-and-answer, virtual environment, chatbot, image generation, and text-to-speech).

INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference

2024 | Reyna Abhyankar * 1 Zijian He * 1 Vikranth Srivatsa 1 Hao Zhang 1 Yiyiing Zhang 1