FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

* Equally contributing first authors,
1Mohamed Bin Zayed University of Artificial Intelligence 2 SEHA Corniche Hospital

While foundation models are transforming many domains of medical imaging, fetal ultrasound analysis remains a significant bottleneck. The complex nature of these images and the lack of readily available multimodal data limit the performance of current approaches. We introduce FetalCLIP, a vision-language foundation model specifically designed and trained for fetal ultrasound image understanding. FetalCLIP leverages multimodal pre-training on a large and diverse dataset to generate universal representations of fetal ultrasound imagery. This innovative approach allows FetalCLIP to capture essential anatomical information, yielding robust representations that can be applied to a variety of clinically relevant downstream tasks.

Figure: Distribution of routine pregnancy ultrasound scan data, which constitutes the largest portion of the FetalCLIP pretraining data.



Abstract

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

FetalCLIP, a novel visual-language foundation model explicitly engineered for fetal ultrasound images.

Main contributions:
  1. Foundation Model: We introduce FetalCLIP, the first-of-its-kind foundation model designed for fetal ultrasound image analysis, incorporating a large pretraining dataset of diverse image-text paired fetal ultrasound images.
  2. Zero-Shot Performance: FetalCLIP achieves outstanding zero-shot performance in fetal plane classification and gestational age estimation, while also effectively clustering fetal anatomical structures, potentially improving workflow efficiency in clinical practice.
  3. Feature Extraction: Our extensive evaluation demonstrates that FetalCLIP serves as a strong feature extractor for fetal ultrasound analysis.

Dataset Collection, Pretraining and Evaluation Pipeline

Figure: Overview of FetalCLIP development and performance. a, Dataset curation of fetal ultrasound image-caption pairs used for the FetalCLIP pretraining. The pretraining data was curated from two sources: (1) routine pregnancy ultra-sound scans, comprising 207,943 images with corresponding LLM-generated pseudocaptions, which incorporate clinicians' labels, gestational age, and pixel spacing; and (2) 2,092 image-caption pairs derived from a fetal ultrasound textbook. b, FetalCLIP pretraining step through contrastive learning, maximizing similarity between paired image-captions while minimizing similarity to unrelated pairs. c, Radar plot demon-strating FetalCLIP's superior performance over existing vision-language foundation models across diverse fetal ultrasound tasks, including fetal planes classification, congenital heart disease detection, and fetal structures segmentation on different views.

Figure: Examples of various image views from the corniche hospital dataset. a, Representative examples of standard views from the fetal ultrasound dataset, showcasing diverse anatomical planes such as 4CH, Femur, Kidney, and Cerebellum. b, Examples of mislabeled samples detected by Confident Learning. c, Ultrasound images containing multiple clinician labels.

Zero-shot capabilities

We conducted a study to evaluate FetalCLIP’s zeroshot capabilities in fetal plane classification and gestational age estimation.

Figure: Zero-shot capabilities of FetalCLIP. a, Illustration of zero-shot fetal plane classification. We leveraged an LLM to generate prompts for a set of predefined candidate planes. The predicted plane was determined by identifying the highest similarity between the image embedding and prompt embeddings. b, Zero-shot performance in distinguishing five standard fetal planes and three brain subplanes. FetalCLIP achieved the highest accuracy with an average F1 score of 87.1%, outperforming the specialist model SonoNet by 17.2%. c, Illustration of zero-shot GA estimation. A similarity map was computed between the image embeddings and prompts embeddings spanning 14 to 40 weeks of GA. We then subsequently postprocessed the similarity map to predict GA. d, GA estimation performance of visual-language foundation models. The blue points represent valid predictions, while the red points indicate invalid predictions. The black line represents the 50th percentile of the quantile regression population, and the orange lines represent the 2.5th and 97.5th percentiles of the population as provided by the World Health Organization. Unlike FetalCLIP, other models demonstrated no ability to infer GA from fetal ultrasound head images.

Linear Probing for classification tasks

Motivated by the growing need for efficient tuning to adapt large pre-trained models to diverse applications. We assessed the capability of the FetalCLIP image encoder to extract generalizable features for downstream fetal ultrasound tasks. In this setup, the image encoder was entirely frozen, while a lightweight network was trained to utilize the extracted features for a specific ultimate task—e.g. a linear layer for classification.

Figure: Linear probing for classification tasks. a, Schematic of linear probing for classifying different fetal planes. The image encoder of a visual-language foundation model was used to extract image embeddings, followed by a trainable linear layer for classification. b-c, F1 scores in the testing set for fetal plane and brain subplane classification, from 5-fold cross-validations with five different seeds. The bars represent the mean F1 scores, while the error bars indicate the standard deviation. d, Illustration of linear probing for CHD detection from an ultrasound clip. Embeddings were extracted from each image in the clip and concatenated. A trainable linear layer was then applied to leverage the combined embeddings for classification. e, AUROC comparisons for CHD detection across 5-fold cross-validations with 5 different seeds. f, ROC curve for CHD prediction showing the median performance of each model.

Segmentation

Accurate pixel-level classification is critical for precise growing fetal biometry calcula- tions. We investigated the foundation models' ability to provide fine-grained intermediate image features essential for localizing fetal anatomical structures. We apply a lightweight decoder with few parameters (∼1.3 million parameters for ViT-B and ∼1.6 million for ViT-L encoders) to utilize the intermediate image features for accurate segmentation of fetal anatomical structures.

Figure: Segmentation of various fetal structures across different views. a, Illustration of the efficient adaptation of an image encoder for segmenting fetal structures in different views. A lightweight decoder was developed to leverage intermediate embeddings from the image encoder for segmentation. NL denotes the number of transformer blocks in the image encoder, which is 12 for ViT-B and 24 for ViT-L, respectively. b, Average segmentation performance across structures within each view (head, abdomen, and 4-chamber) evaluated over 5-fold cross-validations with five different seeds. c-d, Dice Similarity Coefficient (DSC) for individual structures in the abdomen view and 4-chamber view, respectively. LV, left ventricle; LA, left atrium; RA, right atrium; RV, right ventricle; IVS, interventricular septum.

Acknowledgements

We express our gratitude to Corniche Hospital in Abu Dhabi for providing prenatal scan data along with fetal heart scans, and to the Department of Health (DOH) Abu Dhabi for their support in approving the study which facilitates access to the anonymous data for internal purposes. We thank Alfred Z. Abuhamad for allowing us to leverage his book for foundation model pretraining.


For additional details about FetalCLIP dataset collection, pretraining and evaluation pipeline, please refer to our main paper. Thank you!

BibTeX

@article{maani2025fetalclipvisuallanguagefoundationmodel,
      title={FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis},
      author={Maani, Fadillah and Saeed, Numan and Saleem, Tausifa and Farooq, Zaid and Alasmawi, Hussain and Diehl, Werner and Mohammad, Ameera and Waring, Gareth and Valappi, Saudabi and Bricker, Leanne and Yaqub, Mohammad},
      journal={arXiv preprint arXiv:2502.14807},
      year={2025}
    }
}