Score: 1

CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

Published: September 21, 2025 | arXiv ID: 2509.17065v1

By: Yao Du, Jiarong Guo, Xiaomeng Li

Potential Business Impact:

Helps doctors measure heart health better from videos.

Business Areas:
Image Recognition Data and Analytics, Software

Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.

Country of Origin
🇭🇰 Hong Kong

Repos / Data Links

Page Count
11 pages

Category
Computer Science:
CV and Pattern Recognition