Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

πŸ“… 2025-02-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses context-aware human manipulability generation in 2D scenesβ€”i.e., predicting physically plausible and semantically consistent human poses and spatial placements given scene semantics. To overcome challenges in modeling tight pose-scene coupling and poor interpretability, we propose a reciprocal cross-modal attention mechanism that, for the first time, decouples localization, pose selection, and scale/deformation modeling into fine-grained subtasks. Our method integrates a variational autoencoder, cross-modal spatial feature cross-attention, and multi-scale scene context encoding to enable multi-stage conditional sampling. Evaluated on complex 2D scenes, our approach significantly outperforms existing methods, markedly improving pose plausibility and scene alignment. It establishes an interpretable and generalizable foundation for action generation, advancing embodied navigation and reasoning for intelligent agents.

Technology Category

Application Category

πŸ“ Abstract
Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.
Problem

Research questions and friction points this paper is trying to address.

Predicting contextually valid human poses in scenes.
Addressing challenges in human affordance learning.
Improving 2D scene human action prediction accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention mechanism
Variational autoencoder (VAE)
Context-aware pose prediction
πŸ”Ž Similar Papers
No similar papers found.