VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI

๐Ÿ“… 2025-09-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing fMRI-based visual decoding methods rely on subject-specific training, suffering from poor generalizability and scalability. To address this, we propose VoxelFormerโ€”a lightweight cross-subject decoding framework. First, it compresses high-dimensional voxel sequences via Token Merging (ToMer); then, a query-driven Q-Former generates fixed-dimensional neural representations explicitly aligned to the CLIP image embedding space, enabling efficient and semantically consistent mapping from fMRI signals to visual reconstructions. With significantly fewer parameters than state-of-the-art (SOTA) methods, VoxelFormer achieves competitive image retrieval performance on the 7T Natural Scenes Dataset. Crucially, it is the first method to demonstrate effective and scalable multi-subject visual reconstruction without incurring substantial parameter overhead, thereby validating both the feasibility and practicality of generalizable fMRI-to-image decoding.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce extbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.
Problem

Research questions and friction points this paper is trying to address.

Multi-subject fMRI visual decoding without subject-specific training
Parameter-efficient neural decoding using lightweight transformer architecture
Aligning fMRI representations with CLIP image embedding space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight transformer for multi-subject training
Token Merging Transformer for voxel compression
Query-driven Q-Former aligning CLIP embeddings
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chenqian Le
Department of Electrical and Computer Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA
Y
Yilin Zhao
Department of Electrical and Computer Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA
N
Nikasadat Emami
Department of Electrical and Computer Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA
K
Kushagra Yadav
Department of Computer Science and Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA
Xujin "Chris" Liu
Xujin "Chris" Liu
Department of Electrical and Computer Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA
Xupeng Chen
Xupeng Chen
Research Scientist, TikTok | Ph.D. in Electrical Engineering, New York University
LLMMulti-ModalBCIComputer VisionNature Language Processing
Y
Yao Wang
Department of Electrical and Computer Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA