🤖 AI Summary
Existing LVEF estimation methods rely heavily on large-scale annotated echocardiographic video datasets and suffer from poor generalizability; meanwhile, prevailing vision-language models neglect temporal dynamics and fine-grained cardiac anatomical structures. To address these limitations, we propose the first video-level vision-language adaptation framework tailored for few-shot echocardiographic video analysis. Our approach comprises three key innovations: (1) a Multi-Frame Learning (MFL) attention mechanism that enables selective fusion of salient frames to explicitly model temporal dynamics; (2) EchoZoom, a multi-resolution input strategy that enhances local ventricular structural representation; and (3) the first successful adaptation of the CLIP architecture to few-shot LVEF prediction from echocardiographic videos. Evaluated on EchoNet-Dynamic, our method achieves a 2.07 reduction in mean absolute error (MAE) under the 1-shot setting, significantly improving diagnostic accuracy. The source code is publicly available.
📝 Abstract
Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.