🤖 AI Summary
Low-fidelity synthetic human motion data in video understanding induces the “uncanny valley” effect, severely limiting model generalization. To address this, we propose a controllable 3D Gaussian avatar-based method for motion video synthesis. We pioneer the integration of drivable 3D Gaussian splatting into pose transfer frameworks, enabling high-fidelity and temporally coherent human motion generation. Furthermore, we introduce cross-domain background fusion and few-shot augmentation to enhance background diversity and improve coverage of long-tail action categories. We release the RANDOM People dataset—a large-scale, identity-pose disentangled benchmark supporting few-shot extension. Extensive experiments on Toyota Smarthome and NTU RGB+D demonstrate significant improvements in action recognition accuracy, alongside enhanced model robustness and generalization capability across unseen domains and rare classes.
📝 Abstract
In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.