Differentially Private Multimodal In-Context Learning

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing differential privacy (DP) methods struggle to support multimodal, multi-example in-context learning due to rapid consumption of the privacy budget with increasing token counts. The authors propose the first DP framework capable of handling such scenarios by compressing hundreds of multimodal examples into a compact task vector in activation space. Privacy is preserved through data batching, layer-wise gradient clipping to control sensitivity, and injecting a single calibrated noise perturbation on the aggregated task vector, thereby satisfying (ε, δ)-differential privacy while enabling unlimited inference queries. Experiments demonstrate that at ε = 1.0, the method achieves 50% accuracy on VizWiz—substantially outperforming zero-shot baselines (35%) and approaching non-private performance (55%)—with consistent gains validated across eight benchmarks.

Technology Category

Application Category

📝 Abstract
Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
Problem

Research questions and friction points this paper is trying to address.

differential privacy
multimodal learning
in-context learning
vision-language models
privacy cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentially Private
Multimodal In-Context Learning
Task Vectors
Activation Space Aggregation
Private Vision-Language Models
🔎 Similar Papers
No similar papers found.
I
Ivoline C. Ngong
University of Vermont, Burlington, VT, USA
Z
Zarreen Reza
Independent Researcher
Joseph P. Near
Joseph P. Near
University of Vermont
Security & PrivacyDifferential PrivacyProgramming LanguagesFormal MethodsMachine Learning