Learning Human-Object Interaction as Groups

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing HOI detection methods are limited to pairwise relationship modeling, failing to capture collective interactions involving multiple humans and objects co-occurring in realistic scenes. To address this, we propose the first group-centric HOI detection framework, which dynamically clusters HOI instances via a learnable spatial proximity grouping mechanism. We further design a local context-enhanced Transformer decoder that explicitly models soft HOI correspondences in self-attention, while jointly encoding bounding box geometry and semantic features to strengthen higher-order interaction representation. Our approach achieves state-of-the-art performance on HICO-DET and V-COCO, and significantly outperforms prior methods on the more challenging non-linguistic interaction detection task—demonstrating superior generalization to complex, multi-agent interactive scenarios.

Technology Category

Application Category

📝 Abstract

Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.

Problem

Research questions and friction points this paper is trying to address.

Detects human-object interactions beyond pairwise relationships

Models collective behaviors using geometric proximity grouping

Enhances interaction recognition with semantic similarity features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Groups humans and objects via learnable proximity estimator

Computes soft correspondence using self-attention within groups

Enhances transformer decoder with local HO-pair semantic cues

🔎 Similar Papers

No similar papers found.

Authors to Follow