🤖 AI Summary
To address insufficient semantic plausibility and geometric accuracy in human interaction motion generation within dynamic environments, this paper proposes a diffusion-based generative framework jointly guided by text, functional, and joint-level semantics and geometry. Methodologically, it integrates CLIP-based textual encoding, functional graph modeling, and skeletal geometric constraint losses to enable multimodal conditional generation. Crucially, we introduce a novel cross-level semantic alignment mechanism that simultaneously ensures task-intent consistency and physical feasibility throughout motion synthesis. Evaluated on three standard benchmarks, our approach achieves state-of-the-art performance, demonstrating significantly improved generalization to unseen interactive objects and scene layouts. Quantitative and qualitative results confirm superior motion quality and environmental adaptability compared to existing methods.
📝 Abstract
Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at https://4dvlab.github.io/project_page/semgeomo/.