ProCut: LLM Prompt Compression via Attribution Estimation

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Industrial-scale large language model (LLM) systems suffer from increasingly bloated prompt templates—often exceeding thousands of tokens—due to continuous iterative refinement, leading to maintenance overhead, elevated inference latency, and higher serving costs. To address this, we propose a training-free, general-purpose prompt compression framework. Our method leverages LLM-driven attribution estimation to quantify semantic-unit-level utility and enable precise, semantics-aware pruning, seamlessly integrating with existing optimization pipelines. The core innovation lies in introducing attribution analysis into prompt compression, jointly preserving semantic integrity and computational efficiency. Evaluated on real-world industrial deployments, our approach achieves an average 78% token reduction while maintaining or improving task performance; compression gains surpass state-of-the-art methods by up to 62%, and inference latency decreases by over 50%.

Technology Category

Application Category

📝 Abstract

In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

Problem

Research questions and friction points this paper is trying to address.

Reduce bloated prompts in large-scale LLM systems

Maintain performance while compressing prompt templates

Decrease inference latency and serving costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free prompt compression via attribution analysis

Segments prompts into semantic units for pruning

LLM-driven estimator reduces compression latency significantly

🔎 Similar Papers

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression