LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address weak structural modeling, shallow cross-modal interactions, difficult alignment, and poor interpretability in fusing heterogeneous multimodal features—spanning domains, granularities (e.g., token, patch, frame, clip), and modalities—this paper proposes a relation-centered, learnable graph-power fusion paradigm. It maps high-dimensional features into an interpretable graph space and constructs cross-granularity relational graphs. A learnable graph-power operator is introduced to aggregate element-wise relational scores via multivariate polynomials over homogeneous graphs, enabling structural-aware deep interaction. The method balances expressive power and interpretability, achieving multimodal fusion (text, image, video) without explicit alignment. Evaluated on video anomaly detection, it significantly outperforms concatenation, attention-based, and conventional nonlinear fusion baselines, demonstrating strong generalization and effectiveness.

Technology Category

Application Category

📝 Abstract
In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.
Problem

Research questions and friction points this paper is trying to address.

Learnable graph fusion for multi-modal features
Captures deep feature interactions in graph space
Improves video anomaly detection across domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph space for feature fusion
Learnable graph fusion operator
Multilinear polynomial relationship aggregation
🔎 Similar Papers
No similar papers found.