Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

πŸ“… 2025-10-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of simultaneously achieving motion fidelity, diversity, and spatiotemporal coupling in text-driven two-person interaction generation. We propose InterCompose, a synthetic data framework that leverages large language models to generate interaction descriptions and integrates single-person motion priors with a neural motion evaluator to construct high-quality paired interaction sequences. Complementing this, we introduce InterActorβ€”a generative model featuring a word-level text-conditioned encoder and a reactive interaction generator, augmented by an adaptive interaction loss that explicitly enforces physically consistent spatiotemporal dependencies between agents. Our approach significantly outperforms prior methods in motion realism, diversity, and cross-scenario generalization, supports out-of-distribution interaction synthesis, and demonstrates effectiveness through rigorous user studies.

Technology Category

Application Category

πŸ“ Abstract
Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic two-person interactions from text descriptions
Overcoming limited training data for human-human interaction modeling
Improving fine-grained text-to-interaction alignment and spatiotemporal coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable synthesis pipeline generates interaction data
Word-level conditioning preserves fine-grained text cues
Adaptive interaction loss improves physical plausibility
πŸ”Ž Similar Papers
No similar papers found.
Q
Qingxuan Wu
University of Pennsylvania
Z
Zhiyang Dou
University of Pennsylvania, The University of Hong Kong
C
Chuan Guo
Snap Inc.
Y
Yiming Huang
University of Pennsylvania
Q
Qiao Feng
University of Pennsylvania
B
Bing Zhou
Snap Inc.
J
Jian Wang
Snap Inc.
Lingjie Liu
Lingjie Liu
Assistant Professor at UPenn
Computer GraphicsComputer VisionDeep Learning