🤖 AI Summary
Existing model-serving frameworks struggle to efficiently support increasingly complex composite multimodal models. To address this challenge, this work proposes M*, a novel system that introduces a modular Walk Graph abstraction to uniformly represent composite AI models as dataflow graphs. This abstraction enables flexible composition of arbitrary model components, cluster deployment, and model-agnostic distributed runtime optimizations. By leveraging a graph traversal mechanism, M* efficiently handles cross-modal, multitask requests, achieving significant performance gains: it reduces end-to-end latency by 20% over vLLM-Omni in text-to-image generation, improves real-time factor by 2.9× and throughput by 2.7× in text-to-speech tasks, and accelerates robot planning workloads by up to 12.5×.
📝 Abstract
We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.