🤖 AI Summary
This work addresses the communication and scalability bottlenecks caused by centralized coordination in multi-agent systems by proposing a Decentralized Language Model (DeLM) framework. DeLM introduces a shared verification context as a universal communication substrate, integrating asynchronous task queues, local large-model inference, and compact verification updates to enable agents to autonomously claim tasks and collaboratively validate progress—eliminating the need for a central controller. Experimental results demonstrate that DeLM achieves state-of-the-art performance on SWE-bench Verified, with improvements of up to 10.5 percentage points in Avg.@1, Pass@2, and Pass@4 metrics while reducing task cost by approximately 50%. Furthermore, on LongBench-v2 multi-document question answering, DeLM yields an average accuracy gain of 5.7 percentage points across four leading model families.
📝 Abstract
Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.