🤖 AI Summary
This work proposes a privacy-preserving, real-time conversational assistant for procedural tasks such as furniture assembly, operating exclusively on audio and inertial measurement unit (IMU) signals—eliminating the need for video input and thereby mitigating computational overhead and privacy concerns inherent in existing video-based systems. The approach features a novel User Whim Agnostic (UWA) LoRA fine-tuning strategy that suppresses redundant dialogue and enhances the delivery of critical instructions. By integrating lightweight sensor inputs with an edge-deployable architecture, the system operates independently of cloud resources and few-shot prompting. Experimental results demonstrate that UWA fine-tuning improves the F-score by over 30% and accelerates inference by 16×, enabling efficient, low-latency end-to-end execution on resource-constrained wearable devices.
📝 Abstract
Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.