HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high failure rates and computational waste arising from uncoordinated concurrent access by multiple LLM-based coding agents to a shared rate-limited API. To resolve this, the authors propose a transparent HTTP proxy that, for the first time, integrates operating system–inspired scheduling primitives into LLM agent concurrency control—requiring no modifications to existing agent code. The approach synergistically combines five OS-inspired mechanisms: admission control, rate tracking, additive-increase/multiplicative-decrease (AIMD) backpressure with circuit breaking, token budget management, and priority queuing. Experimental results demonstrate that under 5–50 concurrent agents, the method reduces request failure rates from 72–100% to 0–18% and cuts computational waste by 48–100%, while incurring less than 3 ms of scheduling overhead per request.

Technology Category

Application Category

📝 Abstract

When multiple LLM coding agents share a rate-limited API endpoint, they exhibit resource contention patterns analogous to unscheduled OS processes competing for CPU, memory, and I/O. In a motivating incident, 3 of 11 parallel agents died from connection resets and HTTP 502 errors - a 27% failure rate - despite the API having sufficient aggregate capacity to serve all 11 sequentially. We present HIVEMIND, a transparent HTTP proxy that applies five OS-inspired scheduling primitives - admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing - to eliminate the failure modes caused by uncoordinated parallel execution. The proxy requires zero modifications to existing agent code and supports Anthropic, OpenAI, and local model APIs via auto-detected provider profiles. Our evaluation across seven scenarios (5-50 concurrent agents) shows that uncoordinated agents fail at 72-100% rates under contention, while HIVEMIND reduces failures to 0-18% and eliminates 48-100% of wasted compute. An ablation study reveals that transparent retry - not admission control - is the single most critical primitive, but the primitives are most effective in combination. Real-world validation against Ollama confirms that HIVEMIND adds under 3ms of proxy overhead per request. The system is open-source under the MIT license.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

resource contention

rate-limited API

concurrent workloads

failure modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

OS-inspired scheduling

LLM agent coordination

rate-limit management