TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high redundancy in key-value (KV) caches caused by All-Gather communication during synchronized multi-agent LLM execution, where existing approaches struggle to efficiently reuse shared context. The authors propose a collective KV cache sharing mechanism that enables one-shot reuse of entire rounds of shared KV blocks. They introduce a difference-aware storage structure that encodes sibling caches as sparse differentials relative to a primary replica, drastically reducing memory overhead. Integrated with a KV Collector, this design supports efficient, round-spanning collective reuse and management of KV caches. Experiments on GenerativeAgents and AgentSociety benchmarks demonstrate that, compared to vLLM, the method supports up to 2.7× more concurrent agents, reduces per-agent KV cache size by 17.5×, and accelerates prefilling by up to 1.9×.
📝 Abstract
Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.
Problem

Research questions and friction points this paper is trying to address.

multi-agent LLM
KV cache redundancy
All-Gather communication
collective caching
synchronized execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

collective KV cache sharing
All-Gather communication
block-sparse diff encoding
multi-agent LLM serving
KV cache compression
🔎 Similar Papers
No similar papers found.
Z
Zhuohang Bian
Peking University
Feiyang Wu
Feiyang Wu
Georgia Institute of Technology
Reinforcement LearningDeep Learning
Chengrui Zhang
Chengrui Zhang
XJTLU
Deep Learning
H
Hangcheng Dong
Shanghai Jiao Tong University
Y
Yun Liang
Peking University
Youwei Zhuo
Youwei Zhuo
Peking University