π€ AI Summary
This paper addresses context-aware human manipulability generation in 2D scenesβi.e., predicting physically plausible and semantically consistent human poses and spatial placements given scene semantics. To overcome challenges in modeling tight pose-scene coupling and poor interpretability, we propose a reciprocal cross-modal attention mechanism that, for the first time, decouples localization, pose selection, and scale/deformation modeling into fine-grained subtasks. Our method integrates a variational autoencoder, cross-modal spatial feature cross-attention, and multi-scale scene context encoding to enable multi-stage conditional sampling. Evaluated on complex 2D scenes, our approach significantly outperforms existing methods, markedly improving pose plausibility and scene alignment. It establishes an interpretable and generalizable foundation for action generation, advancing embodied navigation and reasoning for intelligent agents.
π Abstract
Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.