A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses scene ambiguity and system latency in human-robot collaborative manipulation under natural language instructions by proposing a ROS 2-based distributed conversational collaboration framework. The framework integrates locally deployed large language models (LLMs) and vision-language models (VLMs) into the robot execution stack, enabling multi-node cooperative parsing of instructions and generation of structured action requests. By fusing depth perception with camera calibration data, it achieves robust mapping from image space to the robot’s coordinate system for target localization. For the first time in ROS 2, this approach realizes modular decoupling and distributed deployment of language understanding, visual grounding, task orchestration, and motion execution, while incorporating an explicit operator confirmation mechanism to ensure safety. Experiments on the Franka Emika FR3 platform demonstrate the system’s end-to-end success rate and response latency across diverse scenarios and evaluate performance variations across multiple LLM/VLM combinations.

📝 Abstract

This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while maintaining a responsive control loop. From free-form user commands, the system generates structured action requests for pick, place, and handover. It uses a VLM to return image-space targets, which are converted into metric robot-frame goals using depth and calibration. A web dashboard exposes intermediate intent and grounding overlays (pixel, depth, and robot-frame) and requires explicit operator confirmation before any motion is executed. Experiments on a Franka FR3 platform evaluate end-to-end task reliability and latency under increasing working table scene ambiguity and compare alternative LLM/VLM configurations in the same pipeline. Code and full documentation are available at [github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm).

Problem

Research questions and friction points this paper is trying to address.

human-robot collaboration

conversational framework

vision-language models

natural language understanding

collaborative manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed conversational framework

vision-language models

ROS 2