🤖 AI Summary
This study addresses the lack of naturalness and scalability in mid-air gesture control for smart rooms. We propose a linguistically inspired triplet gesture grammar—comprising vocative, imperative, and complement components—enabling the first semantic design based on English etymology, where letters and acronyms map directly to gesture meanings. Methodologically, we integrate multimodal sensing from smartphones, smartwatches, and depth cameras to build a trainable cross-device recognition layer, a custom gesture compiler, and an extensible gesture lexicon. A 12-participant user study demonstrates that the grammar is intuitive and efficient (average learning time <15 minutes), with smartphones emerging as the preferred interface due to user familiarity and robust recognition performance. Crucially, the system supports zero-shot extension to novel devices and gestures, ensuring strong scalability and personalization. Our core contribution is the first airborne gesture programming paradigm that jointly satisfies linguistic analogy, social acceptability, and real-world scenario generalizability.
📝 Abstract
Mid-gesture interfaces have become popular for specific scenarios, such as interactions with augmented reality via head-mounted displays, specific controls over smartphones, or gaming platforms. This article explores the use of a location-aware mid-air gesture-based command triplet syntax to interact with a smart space. The syntax, inspired by human language, is built as a vocative case with an imperative structure. In a sentence like “Light, please switch on!”, the object being activated is invoked via making a gesture that mimics its initial letter/acronym (vocative, coincident with the sentence’s elliptical subject). A geometrical or directional gesture then identifies the action (imperative verb) and may include an object feature or a second object with which to network (complement), which also represented by the initial or acronym letter. Technically, an interpreter relying on a trainable multidevice gesture recognition layer makes the pair/triplet syntax decoding possible. The recognition layer works on acceleration and position input signals from graspable (smartphone) and free-hand devices (smartwatch and external depth cameras), as well as a specific compiler. On a specific deployment at a Living Lab facility, the syntax has been instantiated via the use of a lexicon derived from English (with respect to the initial letters and acronyms). A within-subject analysis with twelve users has enabled the analysis of the syntax acceptance (in terms of usability, gesture agreement for actions over objects, and social acceptance) and technology preference of the gesture syntax within its three device implementations (graspable, wearable, and device-free ones). Participants express consensus regarding the simplicity of learning the syntax and its potential effectiveness in managing smart resources. Socially, participants favoured the Watch for outdoor activities and the Phone for home and work settings, underscoring the importance of social context in technology design. The Phone emerged as the preferred option for gesture recognition due to its efficiency and familiarity. The system, which can be adapted to different sensing technologies, addresses the scalability concerns (as it can be easily extended for new objects and actions) and allows for personalised interaction.