CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LVEF estimation methods rely heavily on large-scale annotated echocardiographic video datasets and suffer from poor generalizability; meanwhile, prevailing vision-language models neglect temporal dynamics and fine-grained cardiac anatomical structures. To address these limitations, we propose the first video-level vision-language adaptation framework tailored for few-shot echocardiographic video analysis. Our approach comprises three key innovations: (1) a Multi-Frame Learning (MFL) attention mechanism that enables selective fusion of salient frames to explicitly model temporal dynamics; (2) EchoZoom, a multi-resolution input strategy that enhances local ventricular structural representation; and (3) the first successful adaptation of the CLIP architecture to few-shot LVEF prediction from echocardiographic videos. Evaluated on EchoNet-Dynamic, our method achieves a 2.07 reduction in mean absolute error (MAE) under the 1-shot setting, significantly improving diagnostic accuracy. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.
Problem

Research questions and friction points this paper is trying to address.

Predicting LVEF from echocardiogram videos with limited annotated data
Capturing temporal dynamics in cardiac videos for accurate diagnosis
Improving spatial representation of cardiac structures in video analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based frame fusion mechanism for temporal dynamics
Multi-scale feature extraction refining cardiac spatial representations
Video-based CLIP adaptation for few-shot echocardiogram analysis
🔎 Similar Papers
No similar papers found.