🤖 AI Summary
Interpreting the semantic meaning of voxel-wise responses in the visual cortex remains challenging due to the opacity and semantic agnosticism of conventional fMRI decoding models.
Method: We propose the first cross-modal framework integrating large language models (LLMs)—such as GPT or LLaMA—with image encoders and customized prompt engineering to generate natural-language descriptions of activated fMRI voxels.
Contribution/Results: Our approach enables fine-grained, multi-concept semantic characterization of neural selectivity at both single-voxel and inter-voxel levels, overcoming the limitations of black-box encoding models. Experiments demonstrate significant improvements in descriptive accuracy and semantic richness over state-of-the-art methods. Moreover, we uncover— for the first time—functional fine-grained differentiation within visual cortical regions of interest (ROIs) and voxel-level co-representation of multiple semantic concepts. These findings advance the understanding of human perceptual mechanisms and establish a novel paradigm for interpretable, brain-inspired modeling.
📝 Abstract
Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at https://sites.google.com/view/lavca-llm/