🤖 AI Summary
Standard CNNs struggle to model the contextual sensitivity—particularly peak tuning—of visual cortical neurons. Method: We introduce self-attention to explicitly capture non-local center-surround interactions. We propose “peak tuning” as a novel evaluation metric, combined with tuning curve correlation, to systematically quantify contextual modulation. Through receptive field decomposition and parameter-matched comparisons, we dissect functional specialization between local and surround information in tuning modeling. We demonstrate that self-attention can effectively replace late-stage convolutions and complements fully connected readout layers. Finally, we propose a staged learning paradigm for receptive field development and contextual modulation. Results: Experiments show our model significantly outperforms parameter-matched CNNs on both peak tuning and tuning curve correlation metrics, validating the critical role of surround information in modeling tuning peaks and enhancing the robustness of center-surround interactions.
📝 Abstract
Convolutional neural networks (CNNs) have been shown to be state-of-the-art models for visual cortical neurons. Cortical neurons in the primary visual cortex are sensitive to contextual information mediated by extensive horizontal and feedback connections. Standard CNNs integrate global contextual information to model contextual modulation via two mechanisms: successive convolutions and a fully connected readout layer. In this paper, we find that self-attention (SA), an implementation of non-local network mechanisms, can improve neural response predictions over parameter-matched CNNs in two key metrics: tuning curve correlation and peak tuning. We introduce peak tuning as a metric to evaluate a model's ability to capture a neuron's top feature preference. We factorize networks to assess each context mechanism, revealing that information in the local receptive field is most important for modeling overall tuning, but surround information is critically necessary for characterizing the tuning peak. We find that self-attention can replace posterior spatial-integration convolutions when learned incrementally, and is further enhanced in the presence of a fully connected readout layer, suggesting that the two context mechanisms are complementary. Finally, we find that decomposing receptive field learning and contextual modulation learning in an incremental manner may be an effective and robust mechanism for learning surround-center interactions.