🤖 AI Summary
This work addresses the high deployment costs of multimodal models and their reliance on scarce, precisely aligned data, compounded by inefficiencies in existing generative data augmentation methods that introduce information redundancy and computational overhead through decode-encode cycles. To overcome these limitations, the authors propose leveraging undecoded latent representations from generative models as privileged synthetic data and introduce a Multi-level Explicit Synesthetic Simulation mechanism (MESSy) alongside Direct Latent Augmentation (DLA) to enable efficient cross-modal knowledge transfer. By circumventing conventional decode-encode pipelines and employing predictive knowledge distillation, the approach endows unimodal visual student models with an intrinsic capacity to align with unseen physical attributes. Experiments demonstrate that the proposed framework substantially outperforms current data augmentation and knowledge distillation techniques, yielding high-accuracy visual models imbued with synesthetic latent structures.
📝 Abstract
While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.