[slides and audio] Optimizing LLM Queries in Relational Data Analytics Workloads

This paper addresses the optimization of Large Language Model (LLM) queries in relational data analytics workloads, aiming to reduce computational and monetary costs. The authors propose techniques to reorder rows and fields within each row to maximize key-value (KV) cache reuse, thereby improving LLM inference efficiency. The key contributions include: 1. **Optimal Prefix Hit Recursion (OPHR)**: An optimal algorithm that reorders the table to maximize prefix hits by considering all possible ways to split the table into sub-tables. 2. **Greedy Group Recursion (GGR)**: A greedy approximation of OPHR that leverages functional dependencies and table statistics to reduce computational complexity while achieving near-optimal results. 3. **Implementation**: The techniques are implemented in Apache Spark using PySpark, with the LLM operator implemented as a UDF. 4. **Evaluation**: The effectiveness of the optimizations is evaluated using a benchmark suite of 16 LLM queries across 7 real-world datasets. The results show up to 3.4× speedup in end-to-end query latency and up to 32% cost savings under OpenAI and Anthropic pricing models. The paper highlights the importance of optimizing LLM invocations in relational data analytics, demonstrating significant improvements in performance and cost efficiency.This paper addresses the optimization of Large Language Model (LLM) queries in relational data analytics workloads, aiming to reduce computational and monetary costs. The authors propose techniques to reorder rows and fields within each row to maximize key-value (KV) cache reuse, thereby improving LLM inference efficiency. The key contributions include: 1. **Optimal Prefix Hit Recursion (OPHR)**: An optimal algorithm that reorders the table to maximize prefix hits by considering all possible ways to split the table into sub-tables. 2. **Greedy Group Recursion (GGR)**: A greedy approximation of OPHR that leverages functional dependencies and table statistics to reduce computational complexity while achieving near-optimal results. 3. **Implementation**: The techniques are implemented in Apache Spark using PySpark, with the LLM operator implemented as a UDF. 4. **Evaluation**: The effectiveness of the optimizations is evaluated using a benchmark suite of 16 LLM queries across 7 real-world datasets. The results show up to 3.4× speedup in end-to-end query latency and up to 32% cost savings under OpenAI and Anthropic pricing models. The paper highlights the importance of optimizing LLM invocations in relational data analytics, demonstrating significant improvements in performance and cost efficiency.

Optimizing LLM Queries in Relational Data Analytics Workloads

2025 | Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia