🤖 AI Summary
It remains unclear whether current artificial neural networks (ANNs) trained on video can emulate the appearance-invariant encoding of dynamic information—such as object motion speed—in the macaque inferior temporal (IT) cortex. This study systematically evaluates the predictive power of static, recurrent, and video-trained ANNs by recording neural responses in macaque IT to both natural videos and “appearance-deprived” videos that preserve motion cues while removing shape and texture. Although video-trained ANNs show modest improvements in predicting late-phase IT responses to natural videos, they completely fail to generalize under appearance-deprived conditions. This indicates that these models learn motion features bound to visual appearance rather than the appearance-invariant temporal computations characteristic of biological IT. The findings underscore the need for novel learning objectives that better capture biologically plausible mechanisms for dynamic visual representation.
📝 Abstract
Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT's temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on"appearance-free"variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.