🤖 AI Summary
This study addresses the clinical challenge of subjective, experience-dependent diagnosis of vocal fold paralysis (VFP), lacking objective quantitative metrics. We propose a Multimodal Laryngoscopic Video Analysis System (MLVAS) comprising four key innovations: (1) audio keyword–driven keyframe localization; (2) HSV-based temporal fluctuation analysis for stroboscopic frame identification; (3) a two-stage glottis segmentation framework integrating U-Net and diffusion models; and (4) a laterality-quantification model based on midline angle deviation and interhemispheric vibration variance ratio. On public benchmarks, MLVAS achieves significantly improved segmentation accuracy. Evaluated on real-world clinical laryngoscopic videos, it attains high classification accuracy for VFP detection and provides interpretable, visually grounded dynamic assessment metrics—such as spatiotemporal vibration asymmetry indices—enabling objective, quantitative diagnosis of unilateral VFP. This work establishes a novel paradigm for data-driven, multimodal functional assessment in laryngology.
📝 Abstract
This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Additionally, MLVAS features an advanced strobing video extraction module that specifically identifies strobing frames from laryngeal videostroboscopy by analyzing hue, saturation, and value fluctuations. Beyond key segment extraction, MLVAS provides effective metrics for Vocal Fold Paralysis (VFP) detection. It employs a novel two-stage glottis segmentation process using a U-Net for initial segmentation, followed by a diffusion-based refinement to reduce false positives, providing better segmentation masks for downstream tasks. MLVAS estimates the vibration dynamics for both left and right vocal folds from the segmented glottis masks to detect unilateral VFP by measuring the angle deviation with the estimated glottal midline. Comparing the variance between left's and right's dynamics, the system effectively distinguishes between left and right VFP. We conducted several ablation studies to demonstrate the effectiveness of each module in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.