Optimizing LLM Queries in Relational Data Analytics Workloads

Optimizing LLM Queries in Relational Data Analytics Workloads

2025 | Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
This paper presents techniques to optimize Large Language Model (LLM) invocations in relational data analytics workloads. The key contribution is developing efficient algorithms for reordering rows and fields in input tables to maximize key-value (KV) cache reuse during LLM serving. Our approach can be easily applied to existing analytics systems and serving platforms. Evaluation shows that our solution can yield up to 3.4× improvement in job completion time on a benchmark of diverse LLM-based queries using Llama 3 models. Our solution also achieves a 32% cost savings under OpenAI and Anthropic pricing models. LLM inference is computationally expensive and slow, with high costs for processing large datasets. To address this, we propose techniques that optimize LLM calls for relational data analytics workloads. Our key insight is that with oracular knowledge of all requests, we can reorder both the requests and the fields within each request to increase the number of cache hits. We introduce Optimal Prefix Hit Recursion (OPHR), an algorithm that divides the table into smaller subtables and reorders each subtable to maximize the prefix hits. However, OPHR has exponential complexity, making it impractical for large datasets. To address this, we propose Greedy Group Recursion (GGR), an approximate algorithm that leverages functional dependencies and table statistics to reduce the search space. We implement our techniques in Apache Spark and use vLLM as the model serving backend. We build a benchmark suite of 16 LLM queries of different types, spanning selection, projection, multi-LLM invocations, and retrieval-augmented generation (RAG) queries. Our techniques show 1.5–3.4× speed-up in end-to-end query latency and reduce costs by up to 32% on proprietary model APIs, while preserving query semantics. Our evaluation shows that our approach achieves significant improvements in query latency and cost savings. The GGR algorithm achieves up to 32% cost savings under OpenAI and Anthropic pricing models. The algorithm's performance is also robust to field reordering, with larger models like Llama-3-70B and GPT-4o showing minimal accuracy differences compared to original ordering. The algorithm's overhead is minimal, with GGR running in under 15 seconds for datasets with up to 30K rows and 57 fields.This paper presents techniques to optimize Large Language Model (LLM) invocations in relational data analytics workloads. The key contribution is developing efficient algorithms for reordering rows and fields in input tables to maximize key-value (KV) cache reuse during LLM serving. Our approach can be easily applied to existing analytics systems and serving platforms. Evaluation shows that our solution can yield up to 3.4× improvement in job completion time on a benchmark of diverse LLM-based queries using Llama 3 models. Our solution also achieves a 32% cost savings under OpenAI and Anthropic pricing models. LLM inference is computationally expensive and slow, with high costs for processing large datasets. To address this, we propose techniques that optimize LLM calls for relational data analytics workloads. Our key insight is that with oracular knowledge of all requests, we can reorder both the requests and the fields within each request to increase the number of cache hits. We introduce Optimal Prefix Hit Recursion (OPHR), an algorithm that divides the table into smaller subtables and reorders each subtable to maximize the prefix hits. However, OPHR has exponential complexity, making it impractical for large datasets. To address this, we propose Greedy Group Recursion (GGR), an approximate algorithm that leverages functional dependencies and table statistics to reduce the search space. We implement our techniques in Apache Spark and use vLLM as the model serving backend. We build a benchmark suite of 16 LLM queries of different types, spanning selection, projection, multi-LLM invocations, and retrieval-augmented generation (RAG) queries. Our techniques show 1.5–3.4× speed-up in end-to-end query latency and reduce costs by up to 32% on proprietary model APIs, while preserving query semantics. Our evaluation shows that our approach achieves significant improvements in query latency and cost savings. The GGR algorithm achieves up to 32% cost savings under OpenAI and Anthropic pricing models. The algorithm's performance is also robust to field reordering, with larger models like Llama-3-70B and GPT-4o showing minimal accuracy differences compared to original ordering. The algorithm's overhead is minimal, with GGR running in under 15 seconds for datasets with up to 30K rows and 57 fields.
Reach us at info@study.space