HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

πŸ“… 2025-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-video (T2V) models exhibit insufficient accuracy in generating human-object interaction (HOI) scenes, primarily due to the absence of large-scale, high-fidelity HOI video-text paired datasets. To address this, we introduce HOIGen-1Mβ€”the first million-scale, high-quality HOI video generation dataset. We propose a fine-grained video description method based on multimodal mixture of experts (MoME), design a two-tiered HOI-specific evaluation metric (coarse-to-fine), and establish an automated filtering and cleaning framework integrating multimodal large language model (MLLM)-driven scoring with human verification. Experiments demonstrate that HOIGen-1M significantly improves semantic accuracy and physical plausibility of mainstream T2V models on HOI generation tasks. Our work establishes both a benchmark dataset and a methodological paradigm for HOI-aware video generation.

Technology Category

Application Category

πŸ“ Abstract
Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale HOI video datasets with accurate captions
Current T2V models fail to generate precise human-object interactions
Absence of evaluation metrics for HOI video generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale HOI dataset with curated videos
MoME strategy for accurate video captions
New metrics for coarse-to-fine evaluation
πŸ”Ž Similar Papers
No similar papers found.
K
Kun Liu
JD Explore Academy
Q
Qi Liu
UCAS
Xinchen Liu
Xinchen Liu
JD Explore Academy
Computer VisionMultimedia
J
Jie Li
JD Explore Academy
Y
Yongdong Zhang
USTC
J
Jiebo Luo
University of Rochester
X
Xiaodong He
JD Explore Academy
W
Wu Liu
USTC