HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction

๐Ÿ“… 2025-07-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Predicting social media post popularity is highly challenging due to strong coupling among multimodal factorsโ€”visual content, textual descriptions, temporal dynamics, and user behavioral signals. To address this, we propose a hierarchical multimodal fusion framework. First, we design a cross-modal similarity measurement mechanism to explicitly model inter-modal dependencies. Second, we introduce a pseudo-labeling-based two-stage training strategy to mitigate severe label scarcity. Third, we integrate CLIP for visual representation, Transformer-based models for text encoding, CatBoost/TabNet for structured metadata, and MLPs for user-specific features, enabling end-to-end joint modeling. Our approach achieves third place in the SMP Challenge 2025 Image Track and significantly outperforms all baselines on the SMP benchmark dataset. The implementation is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms. However, predicting post popularity remains challenging due to the complex interplay between visual, textual, temporal, and user behavioral factors. This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction. Our approach employs a three-tier fusion architecture that progressively integrates features across abstraction levels: visual representations from CLIP encoders, textual embeddings from transformer models, and temporal-spatial metadata with user characteristics. The framework implements a hierarchical ensemble strategy combining CatBoost, TabNet, and custom multi-layer perceptrons. To address limited labeled data, we propose a two-stage training methodology with pseudo-labeling and iterative refinement. We introduce novel cross-modal similarity measures and hierarchical clustering features that capture inter-modal dependencies. Experimental results demonstrate that HyperFusion achieves competitive performance on the SMP challenge dataset. Our team achieved third place in the SMP Challenge 2025 (Image Track). The source code is available at https://anonymous.4open.science/r/SMPDImage.
Problem

Research questions and friction points this paper is trying to address.

Predicting social media post popularity with multimodal data
Integrating visual, textual, temporal, and user behavioral factors
Addressing limited labeled data through hierarchical ensemble learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multimodal ensemble learning framework
Three-tier fusion architecture integrating diverse features
Two-stage training with pseudo-labeling and refinement
๐Ÿ”Ž Similar Papers
No similar papers found.