🤖 AI Summary
This work addresses the challenge that vision-language-action (VLA) models struggle with high-precision force-sensitive tasks due to low output frequency and unreliable force control. To overcome this limitation, the authors propose a decoupled architecture in which VLA outputs serve as task-level compliance suggestions, while a passive shielding mechanism—based on energy tanks and high-frequency boundary checks—ensures contact dynamics safety. This mechanism guarantees passivity of the compliant port during execution, enables causal evaluation of semantic contributions, and effectively mitigates interference from geometric shortcuts. Evaluated on simulated and real-world connector insertion and extraction tasks, the system achieves zero passivity violations under adversarial compliance variations and demonstrates significantly higher precision than unshielded baselines.
📝 Abstract
Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.