Differentially Private Multimodal In-Context Learning

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge that existing differential privacy (DP) methods struggle to support multimodal, multi-example in-context learning due to rapid consumption of the privacy budget with increasing token counts. The authors propose the first DP framework capable of handling such scenarios by compressing hundreds of multimodal examples into a compact task vector in activation space. Privacy is preserved through data batching, layer-wise gradient clipping to control sensitivity, and injecting a single calibrated noise perturbation on the aggregated task vector, thereby satisfying (ε, δ)-differential privacy while enabling unlimited inference queries. Experiments demonstrate that at ε = 1.0, the method achieves 50% accuracy on VizWiz—substantially outperforming zero-shot baselines (35%) and approaching non-private performance (55%)—with consistent gains validated across eight benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Problem

Research questions and friction points this paper is trying to address.

differential privacy

multimodal learning

in-context learning

vision-language models

privacy cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentially Private

Multimodal In-Context Learning

Task Vectors