TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neuroscience has long been constrained by unimodal, region-specific, or task-limited paradigms, hindering the development of unified cognitive neural representation models. To address this, we propose the first cross-modal, whole-brain, and subject-adaptive fMRI encoding model. Our method integrates multimodal foundation model representations—text, audio, and video—and employs a temporal Transformer to capture the dynamic neural response to stimuli, enabling accurate prediction of fMRI signals across the entire cerebral cortex, especially in higher-order association areas. Critically, it is the first model to jointly encode all three sensory modalities. Evaluated on the Algonauts 2025 Brain Encoding Challenge, it achieves top-ranked performance, substantially outperforming unimodal baselines. Results demonstrate superior spatial specificity and temporal dynamics, validating its capacity for integrative brain modeling. This work establishes a new paradigm for holistic, multimodal human brain representation.

Technology Category

Application Category

📝 Abstract
Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at https://github.com/facebookresearch/algonauts-2025.
Problem

Research questions and friction points this paper is trying to address.

Predicts brain responses to multi-modal stimuli across regions.
Integrates text, audio, video models for fMRI response accuracy.
Outperforms unimodal models in high-level associative cortices.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal deep neural network for brain response prediction
Transformer integrates text, audio, video representations
Wins Algonauts 2025 competition with superior accuracy
🔎 Similar Papers
No similar papers found.