VIBE: Video-Input Brain Encoder for fMRI Response Modeling

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the challenge of improving fMRI brain response modeling under multimodal naturalistic stimulation (video, audio, and text). We propose a two-stage Transformer architecture: Stage I employs modality-specific foundation models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) to extract heterogeneous features, followed by rotary position encoding for cross-modal spatiotemporal alignment and fusion; Stage II uses a temporal decoder Transformer to predict voxel- or parcel-level fMRI responses. To our knowledge, this is the first framework enabling efficient spatiotemporal alignment and joint modeling of multimodal representations under naturalistic movie paradigms. Trained on 65 hours of CNeuroMod data, our model achieves a mean parcel-wise Pearson correlation of 32.25 on the Friends S07 test set and 21.25 across six out-of-domain films—demonstrating significantly improved in-distribution and out-of-distribution generalization consistency. It ranks first in the Algonauts 2025 Challenge.

Technology Category

Application Category

📝 Abstract

We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

Problem

Research questions and friction points this paper is trying to address.

Predict fMRI activity using multi-modal video features

Fuse video, audio, and text representations for brain encoding

Improve cross-dataset generalization for fMRI response modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage Transformer for fMRI prediction

Fuses multi-modal video, audio, text

Uses rotary embeddings in decoding

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity