InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenging problem of estimating 3D human–object contact points and jointly reconstructing both from a single natural image—complicated by occlusion, depth ambiguity, and diverse object geometries. We propose an end-to-end framework built upon a novel Render–Localize–Lift paradigm: (1) multi-view rendering generates surface embeddings; (2) a lightweight multi-view localization model (MV-Loc) precisely localizes 2D contact regions; and (3) a 3D lifting network, enhanced by fine-tuned vision-language modeling (VLM), performs semantic-aware regression of 3D contact points. Crucially, we formally define and solve the new task of *semantic human contact estimation*, explicitly encoding category-specific geometric constraints on contact locations. Evaluated on standard benchmarks, our method significantly outperforms state-of-the-art approaches, achieving superior accuracy in 3D contact point estimation and higher fidelity in joint human–object 3D reconstruction.

Technology Category

Application Category

📝 Abstract

We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at https://interactvlm.is.tue.mpg.de.

Problem

Research questions and friction points this paper is trying to address.

Estimating 3D human-object contact from 2D images

Overcoming occlusions and depth ambiguities in 3D reasoning

Reducing reliance on expensive 3D contact annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for 3D contact estimation

Introduces Render-Localize-Lift module for 2D to 3D conversion

Proposes Semantic Human Contact estimation task

🔎 Similar Papers

No similar papers found.