LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating semantically consistent robot manipulation trajectories from a single image guided by natural language instructions, enabling open-loop language-conditioned control. To this end, the authors propose LILAC, a novel optical flow–based approach that fuses RGB images with textual commands to produce object-centric 2D optical flow, which is subsequently converted into 6-DoF robotic arm trajectories. Key innovations include a semantic alignment loss to enhance consistency between language and optical flow, and a prompt-conditioned cross-modal adapter that effectively integrates visual prompts with joint vision-language features. Experimental results demonstrate that LILAC generates higher-quality optical flow than existing methods across multiple benchmarks and significantly improves task success rates in real-world robotic manipulation under free-form natural language instructions.

Technology Category

Application Category

📝 Abstract
We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation. Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods. The project page is available at https://lilac-75srg.kinsta.page/.
Problem

Research questions and friction points this paper is trying to address.

language-conditioned manipulation
object-centric optical flow
trajectory generation
vision-language-action
instruction-flow alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Conditioned Manipulation
Object-Centric Optical Flow
Vision-Language-Action Model
Semantic Alignment Loss
Cross-Modal Adapter
🔎 Similar Papers
No similar papers found.
M
Motonari Kambara
Keio University
K
Koki Seno
Keio University
T
Tomoya Kaichi
KDDI Research Inc.
Y
Yanan Wang
KDDI Research Inc.
Komei Sugiura
Komei Sugiura
Professor, Keio University
Multimodal AIRobot LearningEmbodied AIMachine Learning