🤖 AI Summary
Existing approaches struggle to effectively detect diverse forms of social interaction—including face-to-face, virtual, and hybrid modalities—in naturalistic everyday settings, often being constrained to controlled environments or relying on strong assumptions. This work proposes a smartwatch-based, on-device multimodal system that integrates foreground speech recognition with context-aware modeling to enable, for the first time, real-time dynamic detection of varied social interactions in unconstrained, real-world conditions. Leveraging a speech detector and a multimodal fusion model trained on public datasets, the system identifies 1,691 interaction episodes from over 900 hours of in-the-wild wearable data, with 77.28% validated by users. Evaluated across 33,698 fifteen-second windows, it achieves a balanced accuracy of 90.36% and a sensitivity of 91.17%, overcoming conventional limitations related to fixed temporal windows and multi-speaker scenarios.
📝 Abstract
Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%.
We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users' dynamic social environments.