Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes Hölder++, a novel multimodal variational autoencoder that addresses the longstanding trade-off between generation quality and cross-modal semantic consistency. By introducing approximation-free Hölder pooling for the first time in multimodal VAEs, extending the architecture with explicit shared-private representation disentanglement, and designing a hierarchical variational inference mechanism, the model substantially alleviates tensions among generation fidelity, diversity, and modality alignment. This approach yields a more structured latent space and enhances the effectiveness of shared representations for downstream tasks, demonstrating superior performance in both generative quality and cross-modal coherence.

📝 Abstract

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal VAEs

generative quality

coherence

quality-coherence trade-off

cross-modal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hölder pooling

multimodal VAE

shared-private representation