🤖 AI Summary
This work addresses the limitation of single or simplistic fusion models in fully exploiting features from multi-source sensor data by proposing a two-stage, two-level neural network architecture. Building upon late fusion, the method incorporates intermediate cross-modal feature integration and leverages CNNs, LSTMs, and their variants for hierarchical feature extraction. After systematically evaluating 15 network configurations, the optimal model demonstrates significant performance gains over late-fusion-only approaches and existing baselines on two public human activity recognition benchmark datasets. These results validate the effectiveness of the proposed dual-stage fusion strategy in enhancing both model representational capacity and recognition accuracy.
📝 Abstract
Human activity recognition (HAR) refers to the process of identifying human actions and activities using data collected from sensors. Neural networks, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, convolutional LSTM, and their hybrid combinations, have demonstrated exceptional performance in various research domains. Developing a multilevel individual or hybrid model for HAR involves strategically integrating multiple networks to capitalize on their complementary strengths. The structural arrangement of these components is a critical factor influencing the overall performance. This study explores a novel framework of a two-level network architecture with dual-stage feature fusion: late fusion, which combines the outputs from the first network level, and intermediate fusion, which integrates the features from both the first and second levels. We evaluated $15$ different network architectures of CNNs, LSTMs, and convolutional LSTMs, incorporating late fusion with and without intermediate fusion, to identify the optimal configuration. Experimental evaluation on two public benchmark datasets demonstrates that architectures incorporating both late and intermediate fusion achieve higher accuracy than those relying on late fusion alone. Moreover, the optimal configuration outperforms baseline models, thereby validating its effectiveness for HAR.