Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

📅 2024-09-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the clinical challenge of subjective, experience-dependent diagnosis of vocal fold paralysis (VFP), lacking objective quantitative metrics. We propose a Multimodal Laryngoscopic Video Analysis System (MLVAS) comprising four key innovations: (1) audio keyword–driven keyframe localization; (2) HSV-based temporal fluctuation analysis for stroboscopic frame identification; (3) a two-stage glottis segmentation framework integrating U-Net and diffusion models; and (4) a laterality-quantification model based on midline angle deviation and interhemispheric vibration variance ratio. On public benchmarks, MLVAS achieves significantly improved segmentation accuracy. Evaluated on real-world clinical laryngoscopic videos, it attains high classification accuracy for VFP detection and provides interpretable, visually grounded dynamic assessment metrics—such as spatiotemporal vibration asymmetry indices—enabling objective, quantitative diagnosis of unilateral VFP. This work establishes a novel paradigm for data-driven, multimodal functional assessment in laryngology.

Technology Category

Application Category

📝 Abstract

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Additionally, MLVAS features an advanced strobing video extraction module that specifically identifies strobing frames from laryngeal videostroboscopy by analyzing hue, saturation, and value fluctuations. Beyond key segment extraction, MLVAS provides effective metrics for Vocal Fold Paralysis (VFP) detection. It employs a novel two-stage glottis segmentation process using a U-Net for initial segmentation, followed by a diffusion-based refinement to reduce false positives, providing better segmentation masks for downstream tasks. MLVAS estimates the vibration dynamics for both left and right vocal folds from the segmented glottis masks to detect unilateral VFP by measuring the angle deviation with the estimated glottal midline. Comparing the variance between left's and right's dynamics, the system effectively distinguishes between left and right VFP. We conducted several ablation studies to demonstrate the effectiveness of each module in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

Problem

Research questions and friction points this paper is trying to address.

Automatically extract key segments from laryngeal videos

Detect Vocal Fold Paralysis using audio-visual features

Improve glottis segmentation accuracy with diffusion refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal audio-video analysis for VFP diagnosis

Diffusion-based refinement improves glottis segmentation

Angle deviation measures vocal fold movement

🔎 Similar Papers

No similar papers found.

Authors to Follow