🤖 AI Summary
For human activity recognition (HAR) from multidimensional sensor time-series data, this paper proposes a self-supervised Transformer architecture tailored for numerical signals. To address the challenge of effectively modeling raw sensor streams without abundant labeled data, the method introduces two key components: (1) an n-dimensional linear embedding combined with numerical binning to convert continuous sensor measurements into linguistically structured token sequences; and (2) a lightweight output head integrated within a self-supervised pretraining framework, eliminating reliance on large-scale annotated datasets. Evaluated on five mainstream HAR benchmarks, the approach achieves 10–15% higher accuracy than standard Transformers and demonstrates significantly improved cross-device generalization. By enabling effective representation learning directly from unlabeled numerical time series, it establishes a scalable, low-resource paradigm for time-series perception modeling.
📝 Abstract
We developed a deep learning algorithm for human activity recognition using sensor signals as input. In this study, we built a pretrained language model based on the Transformer architecture, which is widely used in natural language processing. By leveraging this pretrained model, we aimed to improve performance on the downstream task of human activity recognition. While this task can be addressed using a vanilla Transformer, we propose an enhanced n-dimensional numerical processing Transformer that incorporates three key features: embedding n-dimensional numerical data through a linear layer, binning-based pre-processing, and a linear transformation in the output layer. We evaluated the effectiveness of our proposed model across five different datasets. Compared to the vanilla Transformer, our model demonstrated 10%-15% improvements in accuracy.