🤖 AI Summary
This work addresses key challenges in multi-agent software engineering—namely high coordination overhead, weak quality control, and insufficient human oversight—by proposing the SPOQ framework. SPOQ innovatively integrates wavefront topological scheduling based on task dependency graphs, dual verification gating for pre- and post-execution quality checks, and a “Human-as-Agent” (HaaA) mechanism within a three-tier agent architecture comprising Opus executors, Sonnet reviewers, and Haiku investigators. Leveraging a locally deployed open-source large language model, Qwen3.6-35B-A3B, the approach achieves a task planning coverage of 99.75%, parallelism of 75.25, a defect rate of 0.20 per task, and a test pass rate of 99.75%. With human intervention via HaaA, residual defects further drop to 0.03 per task, substantially enhancing both collaborative development efficiency and output quality.
📝 Abstract
Multi-agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovations: (1) wave-based topological dispatch that computes parallel execution waves from task dependency graphs; (2) dual validation gates applying quality metrics before execution (planning validation) and after (code validation) to reduce rework cycles; and (3) Human-as-an-Agent (HaaA) integration, where a human specialist participates in decomposition and can be consulted during execution. SPOQ uses a three-tier agent hierarchy (Opus workers, Sonnet reviewers, Haiku investigators) to optimize cost-quality tradeoffs. We evaluate SPOQ through four experiments. Experiment 1: wave dispatch approaches the critical-path lower bound (ratio 1.03--1.11, speedup up to 14.3x); on a 2-slot local backend it delivers a stable 1.4x speedup. Experiment 2: SPOQ improves planning coverage from 93.0 to 99.75, eliminates cyclic plans, and lifts parallelism from 31.0 to 75.25. Experiment 3: dual validation reduces defects from 0.34 to 0.20 per task and lifts test pass rate from 91.25% to 99.75%. Experiment 4: human review reduces residual defects from 0.47 to 0.03 per task. Results are replicated on a locally hosted open-weights model (Qwen3.6-35B-A3B), verifying gains are attributable to orchestration rather than any specific model. A longitudinal study across 17 repositories, 8,589 commits, 1,822 tasks, and 13,866 tests (99.87% pass rate) provides ecological validation.