Towards Efficient Large Multimodal Model Serving

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Multimodal large language models (LMMs) suffer from unstable inference performance, high resource consumption, and severe cross-modal request interference during production deployment. Method: We systematically analyze multi-stage inference behavior and resource contention patterns across six open-source LMMs, comparing decoder-only and cross-attention architectures. We propose a decoupled serving architecture enabling per-stage independent resource allocation and elastic scaling, and introduce stage co-location optimization—jointly maximizing throughput and resource utilization under latency constraints. Our approach integrates multimodal request trajectory modeling, heterogeneous resource scheduling, and compute-memory co-optimization. Contribution/Results: Evaluation shows up to 2.3× higher throughput, 41% reduction in tail latency, and 37% improvement in resource utilization compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs of various modalities such as text, images, video, and audio. While these models demonstrate impressive capabilities, efficiently serving them in production environments poses significant challenges due to their complex architectures and heterogeneous resource requirements. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, on six representative open-source models. We investigate their multi-stage inference pipelines and resource utilization patterns that lead to unique systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions, diverse modal combinations, and bursty traffic patterns. Our key findings reveal that different LMM inference stages exhibit highly heterogeneous performance characteristics and resource demands, while concurrent requests across modalities lead to significant performance interference. To address these challenges, we propose a decoupled serving architecture that enables independent resource allocation and adaptive scaling for each stage. We further propose optimizations such as stage colocation to maximize throughput and resource utilization while meeting the latency objectives.

Problem

Research questions and friction points this paper is trying to address.

Multimodal AI Models

Resource Consumption

Inference Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Service Architecture

Stage-level Resource Allocation

Concurrency Optimization

🔎 Similar Papers

No similar papers found.