🤖 AI Summary
This work proposes Hölder++, a novel multimodal variational autoencoder that addresses the longstanding trade-off between generation quality and cross-modal semantic consistency. By introducing approximation-free Hölder pooling for the first time in multimodal VAEs, extending the architecture with explicit shared-private representation disentanglement, and designing a hierarchical variational inference mechanism, the model substantially alleviates tensions among generation fidelity, diversity, and modality alignment. This approach yields a more structured latent space and enhances the effectiveness of shared representations for downstream tasks, demonstrating superior performance in both generative quality and cross-modal coherence.
📝 Abstract
Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.