🤖 AI Summary
This study addresses the challenge of improving fMRI brain response modeling under multimodal naturalistic stimulation (video, audio, and text). We propose a two-stage Transformer architecture: Stage I employs modality-specific foundation models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) to extract heterogeneous features, followed by rotary position encoding for cross-modal spatiotemporal alignment and fusion; Stage II uses a temporal decoder Transformer to predict voxel- or parcel-level fMRI responses. To our knowledge, this is the first framework enabling efficient spatiotemporal alignment and joint modeling of multimodal representations under naturalistic movie paradigms. Trained on 65 hours of CNeuroMod data, our model achieves a mean parcel-wise Pearson correlation of 32.25 on the Friends S07 test set and 21.25 across six out-of-domain films—demonstrating significantly improved in-distribution and out-of-distribution generalization consistency. It ranks first in the Algonauts 2025 Challenge.
📝 Abstract
We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.