🤖 AI Summary
This work addresses robotic manipulation failures caused by limited field-of-view (FoV). We propose an imitation learning framework integrating active neck motion to dynamically regulate visual perception. Methodologically, we design a systematic teleoperation data collection protocol that synchronously captures articulated neck pose and multi-view visual inputs, and introduce a novel neural architecture that explicitly models the dynamic coupling between neck articulation and hand-eye coordination. Our key contribution is the first integration of active visual control into an end-to-end imitation learning pipeline—overcoming the constraints of fixed-camera setups and enabling continuous perception and manipulation of objects outside the initial FoV. Experiments demonstrate a 90% task success rate under dynamic viewpoint perturbations, significantly outperforming fixed-FoV baselines. The approach exhibits superior robustness and generalization in edge-of-view and occluded scenarios.
📝 Abstract
Most prior research in deep imitation learning has predominantly utilized fixed cameras for image input, which constrains task performance to the predefined field of view. However, enabling a robot to actively maneuver its neck can significantly expand the scope of imitation learning to encompass a wider variety of tasks and expressive actions such as neck gestures. To facilitate imitation learning in robots capable of neck movement while simultaneously performing object manipulation, we propose a teaching system that systematically collects datasets incorporating neck movements while minimizing discomfort caused by dynamic viewpoints during teleoperation. In addition, we present a novel network model for learning manipulation tasks including active neck motion. Experimental results showed that our model can achieve a high success rate of around 90%, regardless of the distraction from the viewpoint variations by active neck motion. Moreover, the proposed model proved particularly effective in challenging scenarios, such as when objects were situated at the periphery or beyond the standard field of view, where traditional models struggled. The proposed approach contributes to the efficiency of dataset collection and extends the applicability of imitation learning to more complex and dynamic scenarios.