Attend to Anything: Foundation Model for Unified Human Attention Modeling

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing approaches to modeling human attention are highly fragmented across modalities, scenarios, and tasks, lacking a unified framework. This work proposes AAM, a foundational model for human attention that leverages language prompts and hierarchical embeddings in hyperbolic space to represent attention as a cognitive entailment from general to specific. Furthermore, it introduces a fluid dynamics framework grounded in the Fokker–Planck equation to jointly model attention in both static images and dynamic videos. AAM is the first method to achieve unified attention modeling across image, video, and audio-visual tasks, outperforming state-of-the-art approaches by an average of 6% across 16 benchmarks while accelerating video inference by approximately fourfold.

📝 Abstract

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

Problem

Research questions and friction points this paper is trying to address.

human attention

saliency modeling

cross-modal generalization

scene-dependent models

task-specific models

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

human attention modeling

hyperbolic embeddings