Correspondence-Oriented Imitation Learning: Flexible Visuomotor Control with 3D Conditioning

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual motor control approaches suffer from inflexible spatiotemporal task representations, hindering adaptation to diverse user intents and operational granularities. Method: We propose COIL—a unified framework that represents tasks as 3D keypoint motions, enabling variable spatiotemporal granularity (sparse or dense) in task specification; incorporates geometric correspondence–aware task encoding and spatiotemporal attention for adaptive multimodal input fusion; and employs a simulation-based self-supervised learning pipeline, leveraging backtracked pixel-level correspondence labels to train a conditional policy network. Contribution/Results: COIL achieves significant performance gains over state-of-the-art methods on real-world manipulation tasks. It demonstrates strong generalization across tasks, objects, and motion patterns—without requiring task-specific fine-tuning or explicit demonstrations.

Technology Category

Application Category

📝 Abstract
We introduce Correspondence-Oriented Imitation Learning (COIL), a conditional policy learning framework for visuomotor control with a flexible task representation in 3D. At the core of our approach, each task is defined by the intended motion of keypoints selected on objects in the scene. Instead of assuming a fixed number of keypoints or uniformly spaced time intervals, COIL supports task specifications with variable spatial and temporal granularity, adapting to different user intents and task requirements. To robustly ground this correspondence-oriented task representation into actions, we design a conditional policy with a spatio-temporal attention mechanism that effectively fuses information across multiple input modalities. The policy is trained via a scalable self-supervised pipeline using demonstrations collected in simulation, with correspondence labels automatically generated in hindsight. COIL generalizes across tasks, objects, and motion patterns, achieving superior performance compared to prior methods on real-world manipulation tasks under both sparse and dense specifications.
Problem

Research questions and friction points this paper is trying to address.

Flexible visuomotor control with 3D task representation
Variable spatial and temporal granularity in task specifications
Robust correspondence-oriented policy via spatio-temporal attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keypoint motion defines flexible 3D task representation
Spatio-temporal attention fuses multimodal inputs for robust grounding
Self-supervised training with hindsight-generated correspondence labels
🔎 Similar Papers
No similar papers found.
Y
Yunhao Cao
Cornell University
Z
Zubin Bhaumik
Cornell University
J
Jessie Jia
Cornell University
Xingyi He
Xingyi He
Zhejiang University
Computer Vision
Kuan Fang
Kuan Fang
Assistant Professor, Cornell CS
RoboticsMachine LearningComputer Vision