Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prompt recognition faces challenges due to temporal asynchrony between hand gestures and lip movements, complicating multimodal fusion, and is further hindered by scarce labeled data. Method: We propose the first collaborative multi-agent system for this task, integrating keyframe selection, expert prompting strategies, and a training-free dynamic fusion mechanism. Our approach synergistically combines a multimodal large language model for gesture recognition, a pretrained Transformer-based lip-reading model, and a self-correcting phoneme-to-word conversion module. This enables, for the first time, end-to-end semantic error correction and natural sentence generation directly from phoneme-level outputs. Contribution/Results: Evaluated on our newly constructed Chinese prompt dataset—comprising 14 subjects—we achieve significant improvements over existing SOTA methods in both hearing-impaired and typical scenarios. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.
Problem

Research questions and friction points this paper is trying to address.

Addresses automatic Cued Speech Recognition with multi-agent collaboration
Overcomes temporal asynchrony in hand-lip multimodal fusion
Enhances performance despite limited training data availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for CS recognition
Keyframe screening with expert prompts
Training-free dynamic multimodal fusion
🔎 Similar Papers
No similar papers found.
Guanjie Huang
Guanjie Huang
The Hong Kong University of Science and Technology (Guangzhou)
Computer Science
D
Danny H.K. Tsang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
S
Shan Yang
Tencent AI Lab, Shenzhen, China
G
Guangzhi Lei
Tencent AI Lab, Shenzhen, China
L
Li Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China