🤖 AI Summary
This work addresses the challenge of learning contact-rich fine manipulation, which is difficult to achieve using only vision and proprioception due to incomplete observation of contact events. The authors propose a method that incorporates object-mounted sensors—such as microphones embedded in fingertips—to capture contact-induced audio during training, using button states as privileged supervision signals to fine-tune an audio encoder for detecting contact events. At inference time, the system relies solely on vision and audio inputs, combined with imitation learning, to perform gentle and precise button pressing. By transforming audio signals into effective contact representations, the approach achieves high task success rates while significantly reducing contact forces, demonstrating its efficacy in tasks requiring delicate physical interaction.
📝 Abstract
Learning contact-rich manipulation is difficult from cameras and proprioception alone because contact events are only partially observed. We test whether training-time instrumentation, i.e., object sensorisation, can improve policy performance without creating deployment-time dependencies. Specifically, we study button pressing as a testbed and use a microphone fingertip to capture contact-relevant audio. We use an instrumented button-state signal as privileged supervision to fine-tune an audio encoder into a contact event detector. We combine the resulting representation with imitation learning using three strategies, such that the policy only uses vision and audio during inference. Button press success rates are similar across methods, but instrumentation-guided audio representations consistently reduce contact force. These results support instrumentation as a practical training-time auxiliary objective for learning contact-rich manipulation policies.