🤖 AI Summary
Diffusion models for generating IMU data often lack physical plausibility, particularly exhibiting poor temporal consistency in synthesized acceleration signals.
Method: This paper proposes a text-to-IMU generation framework that introduces, for the first time, an acceleration-oriented second-order temporal consistency constraint—implemented via an acceleration second-order finite difference loss (L_acc)—into the diffusion model fine-tuning objective. The approach integrates virtual sensor simulation, surface-based human motion modeling, and low-dimensional embedding distribution assessment, with end-to-end human activity recognition (HAR) serving as the primary validation metric.
Contribution/Results: Experiments show L_acc decreases by 12.7% compared to the baseline; synthetic IMU data exhibit significantly improved alignment with real-data distributions in the embedding space; and HAR models trained solely on synthetic data achieve 8.7% higher accuracy than those trained on prior diffusion-generated data, and 7.6% higher than the best existing baseline.
📝 Abstract
We propose a text-to-IMU (inertial measurement unit) motion-synthesis framework to obtain realistic IMU data by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). L_acc enforces consistency in the discrete second-order temporal differences of the generated motion, thereby aligning the diffusion prior with IMU-specific acceleration patterns. We integrate L_acc into the training objective of an existing diffusion model, finetune the model to obtain an IMU-specific motion prior, and evaluate the model with an existing text-to-IMU framework that comprises surface modelling and virtual sensor simulation. We analysed acceleration signal fidelity and differences between synthetic motion representation and actual IMU recordings. As a downstream application, we evaluated Human Activity Recognition (HAR) and compared the classification performance using data of our method with the earlier diffusion model and two additional diffusion model baselines. When we augmented the earlier diffusion model objective with L_acc and continued training, L_acc decreased by 12.7% relative to the original model. The improvements were considerably larger in high-dynamic activities (i.e., running, jumping) compared to low-dynamic activities~(i.e., sitting, standing). In a low-dimensional embedding, the synthetic IMU data produced by our refined model shifts closer to the distribution of real IMU recordings. HAR classification trained exclusively on our refined synthetic IMU data improved performance by 8.7% compared to the earlier diffusion model and by 7.6% over the best-performing comparison diffusion model. We conclude that acceleration-aware diffusion refinement provides an effective approach to align motion generation and IMU synthesis and highlights how flexible deep learning pipelines are for specialising generic text-to-motion priors to sensor-specific tasks.