MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address high inference latency and excessive computational overhead of multimodal large language models (MLLMs) in resource-constrained edge environments, this paper proposes an edge-cloud collaborative, adaptive, heterogeneous modality-aware offloading framework. Our method introduces a lightweight modality-aware module that jointly models visual-language input complexity and real-time system states, enabling a multidimensional feature-driven, dynamic offloading decision mechanism. It supports fine-grained, task-level intelligent scheduling to adaptively distribute computational loads between edge and cloud. Experiments demonstrate that our approach reduces end-to-end inference latency by over 30% compared to baseline methods, cuts GPU memory and computation usage by 30%–65%, and incurs only marginal accuracy degradation (<1.2%). This work establishes a new paradigm for efficient, deployable multimodal AI inference in edge-cloud settings.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) enable powerful cross-modal inference but impose significant computational and latency burdens, posing severe challenges for deployment in resource-constrained environments. In this paper, we propose MoA-Off, an adaptive heterogeneous modality-aware offloading framework with edge-cloud collaboration for efficient MLLM inference. MoA-Off introduces a lightweight heterogeneous modality-aware module that estimates the complexity of heterogeneous inputs through multi-dimensional feature analysis. Then, an adaptive edge-cloud collaborative offloading strategy is proposed that dynamically schedules workloads between edge and cloud based on modality-aware complexity scores and real-time system states. The experimental results demonstrate that MoA-Off can achieve over 30% reduction in latency and 30%-65% decrease in resource overhead while maintaining competitive accuracy compared to traditional approaches.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational and latency burdens in multimodal LLM deployment

Optimizing resource usage for MLLMs in constrained environments

Managing heterogeneous modality complexity through intelligent offloading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-aware module analyzes heterogeneous input complexity

Adaptive offloading strategy schedules workloads edge-cloud

Lightweight framework reduces latency and resource overhead

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models