Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work proposes a local-small-model-based pre-routing architecture to reduce token consumption and cost when coding agents invoke cloud-hosted large language models (LLMs). The system comprehensively evaluates seven strategy combinations—including local routing, prompt compression, semantic caching, local drafting with cloud-side review, minimal-difference editing, structured intent extraction, and batching. It empirically reveals, for the first time, significant variations in optimal strategies across different coding task loads. The authors also open-source a universal routing middleware compatible with both MCP and OpenAI protocols. Experimental results demonstrate that the approach reduces cloud token usage by 45–79% on editing and explanation tasks and achieves a 51% reduction under RAG-intensive workloads when all strategies are combined, substantially lowering costs while maintaining high accuracy.

Technology Category

Application Category

📝 Abstract

We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local routing) combined with T2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while on RAG-heavy workloads the full tactic set including T4 (draft-review) achieves 51% savings. We observe that the optimal tactic subset is workload-dependent, which we believe is the most actionable finding for practitioners deploying coding agents today.

Problem

Research questions and friction points this paper is trying to address.

cloud LLM token usage

coding-agent workloads

local model triage

token reduction

workload-dependent optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

token reduction

local-cloud LLM collaboration

coding agents