VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Gao Huang; Haojun Jiang; Shiji Song; Teng Wang; Yujiao Deng; Yuxuan Wang; Zhenguo Sun

arxiv: 2510.06809 · v3 · pith:BAASUH3Ynew · submitted 2025-10-08 · 💻 cs.CV

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Teng Wang , Haojun Jiang , Yuxuan Wang , Zhenguo Sun , Yujiao Deng , Shiji Song , Gao Huang This is my paper

Pith reviewed 2026-05-21 21:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords VA-Adapterultrasound foundation modelechocardiography probe guidancevision-action adapterparameter-efficient adaptation3D structure inferencemedical imaging AI

0 comments

The pith

A lightweight Vision-Action Adapter injects patient-specific 3D heart structure understanding into pre-trained ultrasound foundation models for echocardiography probe guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the operational difficulty of echocardiography, where probe placement must account for large differences in how each patient's heart appears in 2D images and how the underlying 3D anatomy is shaped. It starts from an ultrasound foundation model already trained on vast data to supply reliable 2D image features, then adds a small Vision-Action Adapter that processes sequences of past images together with probe movements. This lets the system infer the current patient's 3D layout on the fly and suggest the next probe adjustment, without retraining the full foundation model or supplying explicit 3D labels. Experiments across more than 1.31 million samples show the resulting system beats prior probe-guidance approaches while training roughly 33 times fewer parameters.

Core claim

Embedding the VA-Adapter inside the image encoder of an ultrasound foundation model enables the model to infer cardiac anatomy from historical vision-action sequences, thereby supplying the missing patient-specific 3D structure understanding needed for accurate probe guidance without explicit 3D supervision or full-model retraining.

What carries the argument

Vision-Action Adapter (VA-Adapter), a lightweight module inserted into the foundation model's image encoder that processes sequences of 2D images and corresponding probe actions to build patient-specific 3D navigation capability.

If this is right

Probe guidance can be achieved by freezing most of a large foundation model and training only a small adapter.
Adaptation to new patients becomes feasible with far lower compute and data requirements than full retraining.
The same adapter pattern supplies a route for adding 3D context to other 2D foundation models used in medical imaging.
Real-time probe suggestions become practical because the adapter runs on top of an already-trained encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight-adapter pattern could be tried on foundation models for other ultrasound procedures that also need 3D spatial reasoning.
If the adapter can be made still smaller, the whole guidance system might run on portable or bedside ultrasound hardware.
The use of historical sequences suggests the method could support continual adaptation within a single patient exam.

Load-bearing premise

The pre-trained ultrasound foundation model already holds sufficiently robust 2D image representations that a small adapter can add patient-specific 3D understanding without any direct 3D training data or full retraining.

What would settle it

A controlled test on a new set of echocardiography scans from patients with atypical heart geometries in which the VA-Adapter version shows no gain in probe-placement accuracy or navigation success rate over the unmodified foundation model.

Figures

Figures reproduced from arXiv: 2510.06809 by Gao Huang, Haojun Jiang, Shiji Song, Teng Wang, Yujiao Deng, Yuxuan Wang, Zhenguo Sun.

**Figure 1.** Figure 1: Illustration of the dataset. (a) Large-scale diagnostic foundation model dataset. (b) Our dataset statistic. (c) Standard [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the architecture of the VA-Adapter. The left side shows that we insert VA-Adapter into the deep layers [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of different PEFT methods on USFM and BiomedCLIP. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on vision-action interaction module of the EchoCLIP model. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Based on the action predicted by the model, we [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on adapter dimension of the EchoCLIP model. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the model’s prediction. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VA-Adapter adds a lightweight vision-action module to a frozen ultrasound foundation model for probe guidance, with clear efficiency gains on a large dataset but limited direct evidence for true 3D inference.

read the letter

The paper introduces VA-Adapter, a small module embedded in the image encoder of a pre-trained ultrasound foundation model. It processes sequences of 2D images and probe actions to infer patient-specific 3D cardiac structure for guidance tasks. This is the main new piece: a targeted way to add 3D awareness without full retraining or explicit 3D labels, by mimicking how a sonographer builds up spatial understanding over time. They test it on a dataset of over 1.31 million samples and report better performance than existing probe guidance models while training roughly 33 times fewer parameters. That efficiency result is the strongest part and could matter for real deployment where full model updates are costly. The design itself is straightforward and avoids obvious circularity in the claims. The experiments are scaled up, which helps, and the code release is a plus for checking the implementation. The main soft spot is the validation of the 3D part. The stress-test note is on target here. There are no 3D reconstruction losses, explicit 3D labels, or direct probe-position error metrics to show the model has learned geometry rather than just sharpening 2D feature responses. On a dataset this large, the gains could come from better 2D calibration alone, so the central story about solving individual 3D variability rests on indirect evidence. More controls for patient variability and statistical tests would tighten that up. This work is for researchers in medical imaging and efficient adaptation of foundation models. Someone focused on ultrasound applications or lightweight modules for clinical tools would get practical value from the design and scale. It deserves a serious referee because the problem is concrete, the efficiency claim is testable, and the adapter idea is worth closer scrutiny even if the 3D validation needs work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VA-Adapter, a lightweight vision-action adapter embedded into a frozen ultrasound foundation model's image encoder. The adapter enables online inference of patient-specific 3D cardiac anatomy from historical 2D vision-action sequences, addressing individual variability in echocardiography probe guidance. Extensive experiments on a dataset of over 1.31 million samples claim that VA-Adapter outperforms strong probe guidance baselines while training approximately 33 times fewer parameters.

Significance. If the performance gains and parameter efficiency hold under rigorous controls, the work would offer a practical route to deploy foundation models for probe guidance without full retraining, potentially lowering barriers to high-quality echocardiography. The design choice to mimic sonographer cognition via sequential vision-action modeling is conceptually appealing and could generalize to other ultrasound navigation tasks. The reported scale of the dataset is a clear strength.

major comments (2)

[Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.
[Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.

minor comments (2)

[Abstract] The abstract states 'approximately 33 times fewer trained parameters' without giving the absolute parameter counts for VA-Adapter versus the strongest baseline; adding these numbers would improve precision.
[Method] Notation for the vision-action sequence input and how actions are encoded alongside image features could be clarified with a diagram or explicit equations in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.

Authors: We thank the referee for this precise observation. The training data consists of 2D ultrasound frames paired with probe actions and does not contain explicit 3D labels or reconstruction supervision; therefore no 3D reconstruction loss is used. The VA-Adapter instead learns an implicit representation of patient-specific 3D cardiac anatomy by conditioning the frozen foundation-model encoder on historical vision-action sequences, enabling the model to predict probe movements that account for individual geometry. This design choice mirrors how sonographers acquire 3D understanding through sequential observation and action rather than explicit volumetric reconstruction. Probe-guidance performance (success rate, image-quality scores) serves as the downstream metric that validates the utility of this implicit 3D modeling. We will add a dedicated paragraph in the revised Methods and Discussion sections clarifying the implicit versus explicit distinction and will report any available probe-position statistics if they exist in the dataset. revision: partial
Referee: [Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.

Authors: We agree that these details are essential for rigorous verification. In the revised manuscript we will: (i) expand the experimental setup subsection with complete baseline implementation details and hyper-parameter choices, (ii) add statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing VA-Adapter against each baseline, and (iii) explicitly describe inter-patient controls, including patient-wise data splits that ensure no patient overlap between training and test sets together with patient-stratified performance metrics. These additions will be incorporated into the main results table and the accompanying text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on dataset comparisons

full rationale

The paper proposes a VA-Adapter module to adapt a frozen ultrasound foundation model for probe guidance by injecting 3D structure understanding from vision-action sequences. All load-bearing claims (outperformance on 1.31M samples, 33x fewer parameters) are justified solely by empirical experiments against baselines. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The derivation chain is absent; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the existence of a capable pre-trained ultrasound foundation model and introduces one new architectural component whose value is shown through empirical results rather than theoretical derivation.

axioms (1)

domain assumption Ultrasound foundation models learn robust image representations from vast datasets that transfer to probe guidance tasks.
Invoked in the abstract as the basis for leveraging these models before adding the adapter.

invented entities (1)

VA-Adapter no independent evidence
purpose: To inject online understanding of individual 3D cardiac structures by processing vision-action sequences inside the foundation model's image encoder.
New module proposed and embedded in this work; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5789 in / 1203 out tokens · 43340 ms · 2026-05-21T21:10:04.918695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2405.01409 (2024)

Amadou, A.A., Singh, V ., Ghesu, F.C., Kim, Y .H., Stanciulescu, L., Sai, H.P., Sharma, P., Young, A., Rajani, R., Rhode, K.: Goal-conditioned re- inforcement learning for ultrasound navigation guidance. arXiv preprint arXiv:2405.01409 (2024)

work page arXiv 2024
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint- embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619– 15629 (2023)

work page 2023
[3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Bao, M., Wang, Y ., Wei, X., Jia, B., Fan, X., Lu, D., Gu, Y ., Cheng, J., Zhang, Y ., Wang, C., et al.: Real-world visual navigation for cardiac ultrasound view planning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 317–326. Springer (2024)

work page 2024
[4]

Advances in neural information processing systems34, 15084–15097 (2021)

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforce- ment learning via sequence modeling. Advances in neural information processing systems34, 15084–15097 (2021)

work page 2021
[5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9640–9649 (2021)

work page 2021
[6]

Nature Medicine pp

Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision– language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)

work page 2024
[7]

In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23

Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)

work page 2020
[8]

NPJ digital medicine3(1), 10 (2020)

Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y .: Deep learning interpretation of echocardiograms. NPJ digital medicine3(1), 10 (2020)

work page 2020
[9]

IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

Hao, M., Zhang, P., Hou, X., Gu, X., Zhou, X.H., Hou, Z.G., Chen, C., Wang, S.: Towards autonomous cardiac ultrasound scanning: Combining physician expertise and machine intelligence. IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

work page 2025
[10]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., Girshick, R.: Masked autoen- coders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000– 16009 (2022)

work page 2022
[11]

In: Proceedings of the 36th International Conference on Machine Learning

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. pp. 2790–2799 (2019)

work page 2019
[12]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. CoRR abs/2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

In: International Workshop on Advances in Simplifying Medical Ultrasound

Jiang, H., Li, M., Sun, Z., Jia, N., Sun, Y ., Luo, S., Song, S., Huang, G.: Structure-aware world model for probe guidance via large-scale self-supervised pre-train. In: International Workshop on Advances in Simplifying Medical Ultrasound. pp. 58–67. Springer (2024)

work page 2024
[14]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y ., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 190–199. Springer (2024)

work page 2024
[15]

arXiv preprint arXiv:2408.15026 (2024)

Jiang, H., Sun, Z., Sun, Y ., Jia, N., Li, M., Luo, S., Song, S., Huang, G.: Sequence-aware pre-training for echocardiography probe guidance. arXiv preprint arXiv:2408.15026 (2024)

work page arXiv 2024
[16]

Nature Communications16(1), 7893 (2025)

Jiang, H., Zhao, A., Yang, Q., Yan, X., Wang, T., Wang, Y ., Jia, N., Wang, J., Wu, G., Yue, Y ., et al.: Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system. Nature Communications16(1), 7893 (2025)

work page 2025
[17]

Medical Image Analysis96, 103202 (2024)

Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y ., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y ., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

work page 2024
[18]

IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

Li, K., Li, A., Xu, Y ., Xiong, H., Meng, M.Q.H.: Rl-tee: Au- tonomous probe guidance for transesophageal echocardiography based on attention-augmented deep reinforcement learning. IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

work page 2023
[19]

Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers) (2021)

work page 2021
[20]

Advances in Neural Information Processing Systems36(2024)

MH Nguyen, D., Nguyen, H., Diep, N., Pham, T.N., Cao, T., Nguyen, B., Swoboda, P., Ho, N., Albarqouni, S., Xie, P., et al.: Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems36(2024)

work page 2024
[21]

Journal of the American Society of Echocardiography32(1), 1–64 (2019)

Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocar- diographic examination in adults: recommendations from the ameri- can society of echocardiography. Journal of the American Society of Echocardiogra...

work page 2019
[22]

JAMA cardiology6(6), 624–632 (2021)

Narang, A., Bae, R., Hong, H., Thomas, Y ., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocar- diograms for limited diagnostic use. JAMA cardiology6(6), 624–632 (2021)

work page 2021
[23]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Nature 580(7802), 252–256 (2020)

Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)

work page 2020
[25]

Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to

work page 1990
[26]

Journal of the American college of cardiology70(1), 1–25 (2017)

work page 2017
[27]

In: International conference on machine learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J ´egou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347– 10357. PMLR (2021)

work page 2021
[28]

In: European conference on computer vision

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. pp. 20–36. Springer (2016)

work page 2016
[29]

arXiv preprint arXiv:2509.13832 (2025)

Wang, T., Jiang, H., Wang, Y ., Sun, Z., Yan, X., Li, X., Huang, G.: Ul- trahit: A hierarchical transformer architecture for generalizable internal carotid artery robotic ultrasonography. arXiv preprint arXiv:2509.13832 (2025)

work page arXiv 2025
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, L., Zhang, R.Y ., Wang, Y ., Xie, X.: Mma: Multi-modal adapter for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23826–23837 (2024)

work page 2024
[31]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yue, Y ., Wang, Y ., Jiang, H., Liu, P., Song, S., Huang, G.: Echoworld: Learning motion-aware world models for echocardiography probe guid- ance. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25993–26003 (2025)

work page 2025
[32]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y ., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

arXiv preprint arXiv:2405.01409 (2024)

Amadou, A.A., Singh, V ., Ghesu, F.C., Kim, Y .H., Stanciulescu, L., Sai, H.P., Sharma, P., Young, A., Rajani, R., Rhode, K.: Goal-conditioned re- inforcement learning for ultrasound navigation guidance. arXiv preprint arXiv:2405.01409 (2024)

work page arXiv 2024

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint- embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619– 15629 (2023)

work page 2023

[3] [3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Bao, M., Wang, Y ., Wei, X., Jia, B., Fan, X., Lu, D., Gu, Y ., Cheng, J., Zhang, Y ., Wang, C., et al.: Real-world visual navigation for cardiac ultrasound view planning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 317–326. Springer (2024)

work page 2024

[4] [4]

Advances in neural information processing systems34, 15084–15097 (2021)

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforce- ment learning via sequence modeling. Advances in neural information processing systems34, 15084–15097 (2021)

work page 2021

[5] [5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9640–9649 (2021)

work page 2021

[6] [6]

Nature Medicine pp

Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision– language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)

work page 2024

[7] [7]

In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23

Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)

work page 2020

[8] [8]

NPJ digital medicine3(1), 10 (2020)

Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y .: Deep learning interpretation of echocardiograms. NPJ digital medicine3(1), 10 (2020)

work page 2020

[9] [9]

IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

Hao, M., Zhang, P., Hou, X., Gu, X., Zhou, X.H., Hou, Z.G., Chen, C., Wang, S.: Towards autonomous cardiac ultrasound scanning: Combining physician expertise and machine intelligence. IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

work page 2025

[10] [10]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., Girshick, R.: Masked autoen- coders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000– 16009 (2022)

work page 2022

[11] [11]

In: Proceedings of the 36th International Conference on Machine Learning

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. pp. 2790–2799 (2019)

work page 2019

[12] [12]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. CoRR abs/2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

In: International Workshop on Advances in Simplifying Medical Ultrasound

Jiang, H., Li, M., Sun, Z., Jia, N., Sun, Y ., Luo, S., Song, S., Huang, G.: Structure-aware world model for probe guidance via large-scale self-supervised pre-train. In: International Workshop on Advances in Simplifying Medical Ultrasound. pp. 58–67. Springer (2024)

work page 2024

[14] [14]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y ., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 190–199. Springer (2024)

work page 2024

[15] [15]

arXiv preprint arXiv:2408.15026 (2024)

Jiang, H., Sun, Z., Sun, Y ., Jia, N., Li, M., Luo, S., Song, S., Huang, G.: Sequence-aware pre-training for echocardiography probe guidance. arXiv preprint arXiv:2408.15026 (2024)

work page arXiv 2024

[16] [16]

Nature Communications16(1), 7893 (2025)

Jiang, H., Zhao, A., Yang, Q., Yan, X., Wang, T., Wang, Y ., Jia, N., Wang, J., Wu, G., Yue, Y ., et al.: Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system. Nature Communications16(1), 7893 (2025)

work page 2025

[17] [17]

Medical Image Analysis96, 103202 (2024)

Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y ., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y ., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

work page 2024

[18] [18]

IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

Li, K., Li, A., Xu, Y ., Xiong, H., Meng, M.Q.H.: Rl-tee: Au- tonomous probe guidance for transesophageal echocardiography based on attention-augmented deep reinforcement learning. IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

work page 2023

[19] [19]

Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers) (2021)

work page 2021

[20] [20]

Advances in Neural Information Processing Systems36(2024)

MH Nguyen, D., Nguyen, H., Diep, N., Pham, T.N., Cao, T., Nguyen, B., Swoboda, P., Ho, N., Albarqouni, S., Xie, P., et al.: Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems36(2024)

work page 2024

[21] [21]

Journal of the American Society of Echocardiography32(1), 1–64 (2019)

Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocar- diographic examination in adults: recommendations from the ameri- can society of echocardiography. Journal of the American Society of Echocardiogra...

work page 2019

[22] [22]

JAMA cardiology6(6), 624–632 (2021)

Narang, A., Bae, R., Hong, H., Thomas, Y ., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocar- diograms for limited diagnostic use. JAMA cardiology6(6), 624–632 (2021)

work page 2021

[23] [23]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Nature 580(7802), 252–256 (2020)

Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)

work page 2020

[25] [25]

Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to

work page 1990

[26] [26]

Journal of the American college of cardiology70(1), 1–25 (2017)

work page 2017

[27] [27]

In: International conference on machine learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J ´egou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347– 10357. PMLR (2021)

work page 2021

[28] [28]

In: European conference on computer vision

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. pp. 20–36. Springer (2016)

work page 2016

[29] [29]

arXiv preprint arXiv:2509.13832 (2025)

Wang, T., Jiang, H., Wang, Y ., Sun, Z., Yan, X., Li, X., Huang, G.: Ul- trahit: A hierarchical transformer architecture for generalizable internal carotid artery robotic ultrasonography. arXiv preprint arXiv:2509.13832 (2025)

work page arXiv 2025

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, L., Zhang, R.Y ., Wang, Y ., Xie, X.: Mma: Multi-modal adapter for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23826–23837 (2024)

work page 2024

[31] [31]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yue, Y ., Wang, Y ., Jiang, H., Liu, P., Song, S., Huang, G.: Echoworld: Learning motion-aware world models for echocardiography probe guid- ance. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25993–26003 (2025)

work page 2025

[32] [32]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y ., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023