HI-TransPA: Hearing Impairments Translation Personal Assistant

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses daily communication barriers faced by individuals with hearing impairments due to atypical speech articulation. We propose an instruction-driven audiovisual personal assistant specifically designed for this population. Methodologically, we introduce a multimodal preprocessing framework tailored to hearing-impaired speech, incorporating the Omni-Model paradigm—featuring facial landmark-guided lip stabilization, quality-aware curriculum learning, and a novel unified 3D-Resampler for robust fusion of ambiguous acoustic signals and dynamic lip movements. Evaluated on our newly constructed HI-Dialogue dataset, the model achieves state-of-the-art performance in semantic fidelity and literal accuracy. Key contributions include: (1) the first end-to-end joint modeling paradigm for hearing-impaired speech-to-text translation and dialogue understanding; (2) a reusable multimodal preprocessing toolkit; and (3) a systematic technical framework for robust modeling of atypical articulation.

Technology Category

Application Category

📝 Abstract
Hearing-impaired individuals often face significant barriers in daily communication due to the inherent challenges of producing clear speech. To address this, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with lip dynamics, enabling both translation and dialogue within a single multimodal framework. To address the distinctive pronunciation patterns of hearing-impaired speech and the limited adaptability of existing models, we develop a multimodal preprocessing and curation pipeline that detects facial landmarks, stabilizes the lip region, and quantitatively evaluates sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. Architecturally, we employs a novel unified 3D-Resampler to efficiently encode the lip dynamics, which is critical for accurate interpretation. Experiments on purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. Our work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
Problem

Research questions and friction points this paper is trying to address.

Translating indistinct hearing-impaired speech using audio-visual fusion
Addressing unique pronunciation patterns with multimodal preprocessing
Enhancing model robustness through curriculum learning on quality-scored data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses indistinct speech with lip dynamics
Employs a unified 3D-Resampler for lip encoding
Uses curriculum learning guided by quality scores
🔎 Similar Papers
No similar papers found.
Z
Zhiming Ma
SmartFlowAI Research, Shanghai, China
S
Shiyu Gan
Tongji University, Shanghai, China
J
Junhao Zhao
SmartFlowAI Research, Shanghai, China
Xianming Li
Xianming Li
PhD candidate@PolyU, Baking@Mixedbread, Ex Algorithm Engineer@Alipay
Natural Language ProcessingSemantic Textual SimilarityInformation Retrieval
Q
Qingyun Pan
BUPT, Beijing, China
P
Peidong Wang
Northeastern University, Shenyang, China
M
Mingjun Pan
Peking University, Beijing, China
Y
Yuhao Mo
CMIC, Guangzhou, China
J
Jiajie Cheng
CMIC, Guangzhou, China
C
Chengxin Chen
CMIC, Guangzhou, China
Z
Zhonglun Cao
CMIC, Guangzhou, China
C
Chonghan Liu
Qiyuan Tech, Beijing, China
S
Shi Cheng
SmartFlowAI Research, Shanghai, China