Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hard-attention models (e.g., RAM, DRAM) fail to model the hierarchical organization of the human visual neural system, leading to distorted attentional behavior—either excessive fixation or overly frequent saccades—deviating from natural eye-movement patterns. To address this, we propose the Multi-level Recurrent Attention Model (MRAM), the first hard-attention framework explicitly incorporating a dual-layer recurrent architecture. This design decouples gaze localization from task-specific decision-making, enabling human-like fixation–saccade dynamics to emerge spontaneously. MRAM employs glimpse-based encoding, hard attention selection, and end-to-end reinforcement learning. On image classification benchmarks, MRAM significantly outperforms CNNs, RAM, and DRAM in accuracy. Crucially, its generated attention trajectories exhibit strong alignment with empirical human eye-tracking data, demonstrating superior biological plausibility, higher classification accuracy, and robust generalization across diverse visual tasks.

Technology Category

Application Category

📝 Abstract
Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Existing models fail to mimic human vision hierarchy
Current attention models produce unnatural eye movements
Lack of balance between fixation and saccadic behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Level Recurrent Attention Model (MRAM) proposed
Decouples glimpse location and task execution
Balances fixation and saccadic movement behavior
🔎 Similar Papers
No similar papers found.