Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing human-object interaction (HOI) detection methods struggle to effectively leverage the rich contextual cues present in complex scenes, limiting their ability to accurately interpret intricate interactions. This work proposes InCoM-Net, a novel framework that introduces an instance-centric context mining mechanism to jointly model intra-instance, inter-instance, and global scene context by integrating semantic knowledge from vision-language models with instance features from object detectors. The framework incorporates an Instance-Centric Context Refinement (ICR) module and a Progressive Context Aggregation (ProCA) mechanism, enabling iterative fusion of multi-level contextual information to significantly enhance both semantic and spatial reasoning capabilities. Evaluated on the HICO-DET and V-COCO benchmarks, the proposed method achieves state-of-the-art performance, substantially outperforming existing HOI detection approaches.

Technology Category

Application Category

📝 Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction Detection

Vision-Language Models

Contextual Reasoning

Instance-Centric Context

Semantic Priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instance-centric Context

Vision-Language Models

Human-Object Interaction Detection