MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of automatic standard anatomical plane localization in fetal ultrasound videos—namely, low accuracy, high manual annotation cost, and poor inter-observer consistency—this paper proposes a novel vision-query-driven video segment localization paradigm. Methodologically, we introduce the first multi-level class-aware Token Transformer architecture that jointly models temporal dynamics and anatomical semantics; incorporate a vision-query mechanism for target-oriented retrieval; and significantly reduce computational overhead via token sparsification and cross-domain (ultrasound + natural video) joint training. Experiments demonstrate substantial improvements: +10–13% mIoU on ultrasound datasets and +5.35% on Ego4D, using only 4% of the original token count. The proposed method achieves an optimal trade-off among accuracy, efficiency, and generalizability, showing strong potential for clinical deployment in resource-constrained settings, particularly low- and middle-income countries (LMICs).

Technology Category

Application Category

📝 Abstract
Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra- and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-Aware Token Transformer (MCAT), a visual query-based video clip localization (VQ-VCL) method, to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCAT returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCAT on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mIoU on the ultrasound datasets and by 5.35% mIoU on the Ego4D dataset, using 96% fewer tokens. MCAT's efficiency and accuracy have significant potential implications for public health, especially in low- and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US-based screening, diagnosis and allowing sonographers to examine more patients.
Problem

Research questions and friction points this paper is trying to address.

Automates standard frame selection in fetal ultrasound videos
Reduces variability and time in manual frame selection
Improves accuracy in locating anatomical clips via visual queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual query-based video clip localization method
Multi-Tier Class-Aware Token Transformer (MCAT)
Efficient token usage with high accuracy
🔎 Similar Papers
No similar papers found.
Divyanshu Mishra
Divyanshu Mishra
DPhil Student at University of Oxford
Video UnderstandingVideo SSLMedical Image AnalysisMulti-Modal LearningUltrasound
Pramit Saha
Pramit Saha
Department of Engineering Science, University of Oxford
Deep LearningFederated LearningMultimodal LearningComputer VisionMedical Image Analysis
H
He Zhao
Institute of Life Course and Medical Sciences, University of Liverpool
N
Netzahualcoyotl Hernandez-Cruz
Department of Engineering Science, University of Oxford
O
Olga Patey
Nuffield Department of Women’s and Reproductive Health, University of Oxford
A
Aris Papageorghiou
Nuffield Department of Women’s and Reproductive Health, University of Oxford
J
J. Alison Noble
Department of Engineering Science, University of Oxford