Optimizing LLM Queries in Relational Data Analytics Workloads

๐Ÿ“… 2024-03-09
๐Ÿ“ˆ Citations: 19
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) incur high inference latency and suffer from low throughput when processing batched relational data, primarily due to inefficient key-value (KV) cache reuse across heterogeneous input tables. Method: This paper proposes a relational-structure-aware LLM query reordering technique that jointly optimizes both row ordering and column ordering of input tables to maximize KV cache reuseโ€”without modifying the model, API, or serving infrastructure, ensuring full compatibility with existing LLM serving systems. Contribution/Results: By integrating KV cache mechanism analysis, greedy locality optimization, and relational table structure-aware modeling, our approach is the first to co-optimize row-level and column-level locality. Evaluated on the Llama 3 benchmark, it achieves a 3.4ร— end-to-end job completion speedup; cost analysis using OpenAI and Anthropic pricing models shows a 32% reduction in LLM invocation costs.

Technology Category

Application Category

๐Ÿ“ Abstract
Batch data analytics is a growing application for Large Language Models (LLMs). LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets. However, LLM inference is highly costly and slow: for example, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a day to handle 15 GB of data; processing a similar amount of data costs around $10K on OpenAI's GPT-4o. In this paper, we propose novel techniques that can significantly reduce the cost of LLM calls for relational data analytics workloads. Our key contribution is developing efficient algorithms for reordering the rows and the fields within each row of an input table to maximize key-value (KV) cache reuse when performing LLM serving. As such, our approach can be easily applied to existing analytics systems and serving platforms. Our evaluation shows that our solution can yield up to 3.4x improvement in job completion time on a benchmark of diverse LLM-based queries using Llama 3 models. Our solution also achieves a 32% cost savings under OpenAI and Anthropic pricing models.
Problem

Research questions and friction points this paper is trying to address.

Reducing high cost of LLM inference in analytics
Optimizing KV cache reuse for LLM serving
Improving efficiency of relational data LLM queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes row and field reordering for KV cache reuse
Reduces LLM inference cost in relational analytics
Improves job completion time by 3.4x
๐Ÿ”Ž Similar Papers
No similar papers found.