Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Prompt recognition faces challenges due to temporal asynchrony between hand gestures and lip movements, complicating multimodal fusion, and is further hindered by scarce labeled data. Method: We propose the first collaborative multi-agent system for this task, integrating keyframe selection, expert prompting strategies, and a training-free dynamic fusion mechanism. Our approach synergistically combines a multimodal large language model for gesture recognition, a pretrained Transformer-based lip-reading model, and a self-correcting phoneme-to-word conversion module. This enables, for the first time, end-to-end semantic error correction and natural sentence generation directly from phoneme-level outputs. Contribution/Results: Evaluated on our newly constructed Chinese prompt dataset—comprising 14 subjects—we achieve significant improvements over existing SOTA methods in both hearing-impaired and typical scenarios. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.

Problem

Research questions and friction points this paper is trying to address.

Addresses automatic Cued Speech Recognition with multi-agent collaboration

Overcomes temporal asynchrony in hand-lip multimodal fusion

Enhances performance despite limited training data availability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for CS recognition

Keyframe screening with expert prompts

Training-free dynamic multimodal fusion

🔎 Similar Papers

No similar papers found.

Authors to Follow