Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenging problem of 3D human-object interaction (HOI) reconstruction from a single image under open-vocabulary settings—particularly hindered by the absence of real 3D annotations and constraints imposed by indoor-only scenes. We propose the first open-vocabulary 3D HOI method designed for in-the-wild, complex environments. To this end, we introduce Open3DHOI, the first large-scale in-the-wild 3D HOI dataset containing over 2.5k finely annotated samples. Our method features a high-fidelity single-image reconstruction pipeline, incorporating a novel Gaussian-HOI optimizer that jointly models spatial geometry and physically plausible contact regions. We further define several new 3D HOI understanding tasks. The framework integrates monocular 3D reconstruction, differentiable Gaussian representations, contact-aware optimization, and open-vocabulary object generalization. Experiments demonstrate significant improvements over existing baselines in interaction geometry accuracy and contact localization. Both the dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract

Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing human-object interactions from single images

Overcoming limitations of 3D HOI datasets in real-world scenes

Developing a pipeline for fine-grained 3D HOI annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline for annotating 3D HOI from single images

Open3DHOI: first open-vocabulary in-the-wild 3D HOI dataset

Gaussian-HOI optimizer for spatial interaction reconstruction

🔎 Similar Papers

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

2024-09-19arXiv.orgCitations: 0

💼 Related Jobs

3D Computer Vision Researcher

Kitware

Arlington, Virginia

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)