π€ AI Summary
This work addresses the challenge of enabling autonomous underwater vehicles to interpret diver behavior for effective collaboration and safety assurance in high-risk underwater environments. To this end, the authors propose DAR-Net, a novel framework that integrates Transformer-based temporal modeling with pixel-level semantic supervision, leveraging a multi-task loss to jointly achieve global activity recognition and local humanβrobot interaction alignment. The study introduces a semantics-guided learning paradigm and presents the first Underwater Diver Activity (UDA) dataset, comprising over 2,600 images with pixel-level mask annotations, which mitigates challenges posed by poor visibility and data scarcity. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches across six diver activity categories, establishing a foundation for intelligent underwater collaborative systems.
π Abstract
Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.