🤖 AI Summary
Existing Transformer-based multimodal models struggle to efficiently capture high-order modality interactions, often being limited to pairwise interactions or suffering from quadratic computational complexity with respect to the number of modalities. This work proposes GRAMformer, a novel multimodal Transformer architecture centered on a Volume-based Multimodal Attention (VMA) mechanism. VMA computes attention scores as the volume of the parallelotope spanned by query vectors and multimodal key vectors, inherently enabling the modeling of joint dependencies among arbitrary numbers of modalities. The approach achieves significant improvements in multimodal fusion performance while maintaining computational efficiency, outperforming state-of-the-art methods across multiple benchmark tasks and demonstrating the effectiveness of explicitly modeling high-order interactions.
📝 Abstract
Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.