Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing model compression methods struggle to handle dynamic power constraints and stochastic sensor missingness in edge-based multimodal inference, often relying on post-deployment fine-tuning that incurs high energy consumption and poor adaptability. This work proposes SentryFuse, a novel framework that introduces modality-aware, zero-shot pruning for the first time. By leveraging first-order gradient saliency to supervise the training of modality-conditional importance scores, SentryFuse dynamically prunes attention heads and feed-forward channels at deployment without requiring any fine-tuning. It further replaces dense self-attention with sparse grouped-query attention. Evaluated across three multimodal architectures, the method achieves an average accuracy gain of 12.7%—up to 18% under sensor missingness—while reducing memory usage by 28.2%, peak latency by up to 1.63×, and computational cost by 15% GFLOPs.

Technology Category

Application Category

📝 Abstract
Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.
Problem

Research questions and friction points this paper is trying to address.

multimodal inference
edge computing
sensor dropout
model pruning
zero-shot compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-aware pruning
zero-shot compression
sparse attention
multimodal edge inference
sensor dropout robustness
🔎 Similar Papers
No similar papers found.