MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

244K/year
🤖 AI Summary
This work addresses the scheduling inefficiencies in Mixture-of-Agents (MoA) systems under limited GPU resources, which arise from imbalanced expert loads and varying generation lengths. To tackle this, the authors propose an integer linear programming-based scheduler that jointly optimizes expert deployment and request assignment, supporting both dynamic expert replication and fixed placement. Additionally, a confidence-aware adaptive aggregation mechanism is introduced to skip redundant computations based on inter-expert consistency. By integrating offline performance modeling with concurrent multi-model inference, the approach achieves up to 2.5× speedup in the expert phase, 4.23× in the aggregation phase, and 1.7–2.3× end-to-end acceleration on a 4-GPU system, with accuracy degradation of less than 0.1 percentage point.
📝 Abstract
Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based routing creates skewed expert demand, and combining instruction-tuned LLMs with long-reasoning models results in extreme variability in generation lengths. Consequently, traditional scheduling strategies suffer from significant GPU idling and throughput collapse due to load imbalances. We present MOSAIC, a scheduling framework to accelerate MoA workloads. First, we formulate an Integer Linear Program (ILP) based scheduler that jointly optimizes expert placement and per-worker prompt assignment from offline-profiled costs, replicating reasoning experts across workers while pinning lightweight ones. Second, MOSAIC uses confidence-aware adaptive aggregation, leveraging inter-expert agreement to bypass the heavy final aggregator LLM for consensus queries. In our 4-GPU system, MOSAIC achieves up to 2.5x expert-stage, 4.23x aggregator-stage and 1.7~2.3x end-to-end speedups over the baseline scheduler, while matching accuracy within 0.1pp.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Agents
GPU scheduling
load imbalance
inference concurrency
expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Agents
Adaptive Aggregation
Inference Scheduling
Load Balancing
LLM Acceleration
🔎 Similar Papers
2024-05-062024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)Citations: 2