Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

📅 2024-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the clinical challenge of subjective, experience-dependent diagnosis of vocal fold paralysis (VFP), lacking objective quantitative metrics. We propose a Multimodal Laryngoscopic Video Analysis System (MLVAS) comprising four key innovations: (1) audio keyword–driven keyframe localization; (2) HSV-based temporal fluctuation analysis for stroboscopic frame identification; (3) a two-stage glottis segmentation framework integrating U-Net and diffusion models; and (4) a laterality-quantification model based on midline angle deviation and interhemispheric vibration variance ratio. On public benchmarks, MLVAS achieves significantly improved segmentation accuracy. Evaluated on real-world clinical laryngoscopic videos, it attains high classification accuracy for VFP detection and provides interpretable, visually grounded dynamic assessment metrics—such as spatiotemporal vibration asymmetry indices—enabling objective, quantitative diagnosis of unilateral VFP. This work establishes a novel paradigm for data-driven, multimodal functional assessment in laryngology.

Technology Category

Application Category

📝 Abstract
This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Additionally, MLVAS features an advanced strobing video extraction module that specifically identifies strobing frames from laryngeal videostroboscopy by analyzing hue, saturation, and value fluctuations. Beyond key segment extraction, MLVAS provides effective metrics for Vocal Fold Paralysis (VFP) detection. It employs a novel two-stage glottis segmentation process using a U-Net for initial segmentation, followed by a diffusion-based refinement to reduce false positives, providing better segmentation masks for downstream tasks. MLVAS estimates the vibration dynamics for both left and right vocal folds from the segmented glottis masks to detect unilateral VFP by measuring the angle deviation with the estimated glottal midline. Comparing the variance between left's and right's dynamics, the system effectively distinguishes between left and right VFP. We conducted several ablation studies to demonstrate the effectiveness of each module in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.
Problem

Research questions and friction points this paper is trying to address.

Automatically extract key segments from laryngeal videos
Detect Vocal Fold Paralysis using audio-visual features
Improve glottis segmentation accuracy with diffusion refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal audio-video analysis for VFP diagnosis
Diffusion-based refinement improves glottis segmentation
Angle deviation measures vocal fold movement
🔎 Similar Papers
No similar papers found.
Yucong Zhang
Yucong Zhang
Ph.D. Student in CS, Wuhan University
X
Xin Zou
Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510000, China
J
Jinshan Yang
Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510000, China
W
Wenjun Chen
Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510000, China
Juan Liu
Juan Liu
Wuhan University
Data MiningArtificial Intelligence in BioinformaticsBiomedicine
F
Faya Liang
Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510000, China
M
Ming Li
School of Computer Science, Wuhan University, Wuhan 430072, China and the Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Data Science Research Center, Duke Kunshan University, Suzhou 215316, China