pith. sign in

arxiv: 2510.06809 · v3 · pith:BAASUH3Ynew · submitted 2025-10-08 · 💻 cs.CV

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Pith reviewed 2026-05-21 21:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords VA-Adapterultrasound foundation modelechocardiography probe guidancevision-action adapterparameter-efficient adaptation3D structure inferencemedical imaging AI
0
0 comments X

The pith

A lightweight Vision-Action Adapter injects patient-specific 3D heart structure understanding into pre-trained ultrasound foundation models for echocardiography probe guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the operational difficulty of echocardiography, where probe placement must account for large differences in how each patient's heart appears in 2D images and how the underlying 3D anatomy is shaped. It starts from an ultrasound foundation model already trained on vast data to supply reliable 2D image features, then adds a small Vision-Action Adapter that processes sequences of past images together with probe movements. This lets the system infer the current patient's 3D layout on the fly and suggest the next probe adjustment, without retraining the full foundation model or supplying explicit 3D labels. Experiments across more than 1.31 million samples show the resulting system beats prior probe-guidance approaches while training roughly 33 times fewer parameters.

Core claim

Embedding the VA-Adapter inside the image encoder of an ultrasound foundation model enables the model to infer cardiac anatomy from historical vision-action sequences, thereby supplying the missing patient-specific 3D structure understanding needed for accurate probe guidance without explicit 3D supervision or full-model retraining.

What carries the argument

Vision-Action Adapter (VA-Adapter), a lightweight module inserted into the foundation model's image encoder that processes sequences of 2D images and corresponding probe actions to build patient-specific 3D navigation capability.

If this is right

  • Probe guidance can be achieved by freezing most of a large foundation model and training only a small adapter.
  • Adaptation to new patients becomes feasible with far lower compute and data requirements than full retraining.
  • The same adapter pattern supplies a route for adding 3D context to other 2D foundation models used in medical imaging.
  • Real-time probe suggestions become practical because the adapter runs on top of an already-trained encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight-adapter pattern could be tried on foundation models for other ultrasound procedures that also need 3D spatial reasoning.
  • If the adapter can be made still smaller, the whole guidance system might run on portable or bedside ultrasound hardware.
  • The use of historical sequences suggests the method could support continual adaptation within a single patient exam.

Load-bearing premise

The pre-trained ultrasound foundation model already holds sufficiently robust 2D image representations that a small adapter can add patient-specific 3D understanding without any direct 3D training data or full retraining.

What would settle it

A controlled test on a new set of echocardiography scans from patients with atypical heart geometries in which the VA-Adapter version shows no gain in probe-placement accuracy or navigation success rate over the unmodified foundation model.

Figures

Figures reproduced from arXiv: 2510.06809 by Gao Huang, Haojun Jiang, Shiji Song, Teng Wang, Yujiao Deng, Yuxuan Wang, Zhenguo Sun.

Figure 1
Figure 1. Figure 1: Illustration of the dataset. (a) Large-scale diagnostic foundation model dataset. (b) Our dataset statistic. (c) Standard [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the architecture of the VA-Adapter. The left side shows that we insert VA-Adapter into the deep layers [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of different PEFT methods on USFM and BiomedCLIP. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on vision-action interaction module of the EchoCLIP model. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Based on the action predicted by the model, we [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on adapter dimension of the EchoCLIP model. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the model’s prediction. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VA-Adapter, a lightweight vision-action adapter embedded into a frozen ultrasound foundation model's image encoder. The adapter enables online inference of patient-specific 3D cardiac anatomy from historical 2D vision-action sequences, addressing individual variability in echocardiography probe guidance. Extensive experiments on a dataset of over 1.31 million samples claim that VA-Adapter outperforms strong probe guidance baselines while training approximately 33 times fewer parameters.

Significance. If the performance gains and parameter efficiency hold under rigorous controls, the work would offer a practical route to deploy foundation models for probe guidance without full retraining, potentially lowering barriers to high-quality echocardiography. The design choice to mimic sonographer cognition via sequential vision-action modeling is conceptually appealing and could generalize to other ultrasound navigation tasks. The reported scale of the dataset is a clear strength.

major comments (2)
  1. [Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.
  2. [Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.
minor comments (2)
  1. [Abstract] The abstract states 'approximately 33 times fewer trained parameters' without giving the absolute parameter counts for VA-Adapter versus the strongest baseline; adding these numbers would improve precision.
  2. [Method] Notation for the vision-action sequence input and how actions are encoded alongside image features could be clarified with a diagram or explicit equations in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.

    Authors: We thank the referee for this precise observation. The training data consists of 2D ultrasound frames paired with probe actions and does not contain explicit 3D labels or reconstruction supervision; therefore no 3D reconstruction loss is used. The VA-Adapter instead learns an implicit representation of patient-specific 3D cardiac anatomy by conditioning the frozen foundation-model encoder on historical vision-action sequences, enabling the model to predict probe movements that account for individual geometry. This design choice mirrors how sonographers acquire 3D understanding through sequential observation and action rather than explicit volumetric reconstruction. Probe-guidance performance (success rate, image-quality scores) serves as the downstream metric that validates the utility of this implicit 3D modeling. We will add a dedicated paragraph in the revised Methods and Discussion sections clarifying the implicit versus explicit distinction and will report any available probe-position statistics if they exist in the dataset. revision: partial

  2. Referee: [Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.

    Authors: We agree that these details are essential for rigorous verification. In the revised manuscript we will: (i) expand the experimental setup subsection with complete baseline implementation details and hyper-parameter choices, (ii) add statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing VA-Adapter against each baseline, and (iii) explicitly describe inter-patient controls, including patient-wise data splits that ensure no patient overlap between training and test sets together with patient-stratified performance metrics. These additions will be incorporated into the main results table and the accompanying text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on dataset comparisons

full rationale

The paper proposes a VA-Adapter module to adapt a frozen ultrasound foundation model for probe guidance by injecting 3D structure understanding from vision-action sequences. All load-bearing claims (outperformance on 1.31M samples, 33x fewer parameters) are justified solely by empirical experiments against baselines. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The derivation chain is absent; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the existence of a capable pre-trained ultrasound foundation model and introduces one new architectural component whose value is shown through empirical results rather than theoretical derivation.

axioms (1)
  • domain assumption Ultrasound foundation models learn robust image representations from vast datasets that transfer to probe guidance tasks.
    Invoked in the abstract as the basis for leveraging these models before adding the adapter.
invented entities (1)
  • VA-Adapter no independent evidence
    purpose: To inject online understanding of individual 3D cardiac structures by processing vision-action sequences inside the foundation model's image encoder.
    New module proposed and embedded in this work; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5789 in / 1203 out tokens · 43340 ms · 2026-05-21T21:10:04.918695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2405.01409 (2024)

    Amadou, A.A., Singh, V ., Ghesu, F.C., Kim, Y .H., Stanciulescu, L., Sai, H.P., Sharma, P., Young, A., Rajani, R., Rhode, K.: Goal-conditioned re- inforcement learning for ultrasound navigation guidance. arXiv preprint arXiv:2405.01409 (2024)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint- embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619– 15629 (2023)

  3. [3]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Bao, M., Wang, Y ., Wei, X., Jia, B., Fan, X., Lu, D., Gu, Y ., Cheng, J., Zhang, Y ., Wang, C., et al.: Real-world visual navigation for cardiac ultrasound view planning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 317–326. Springer (2024)

  4. [4]

    Advances in neural information processing systems34, 15084–15097 (2021)

    Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforce- ment learning via sequence modeling. Advances in neural information processing systems34, 15084–15097 (2021)

  5. [5]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9640–9649 (2021)

  6. [6]

    Nature Medicine pp

    Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision– language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)

  7. [7]

    In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23

    Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)

  8. [8]

    NPJ digital medicine3(1), 10 (2020)

    Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y .: Deep learning interpretation of echocardiograms. NPJ digital medicine3(1), 10 (2020)

  9. [9]

    IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

    Hao, M., Zhang, P., Hou, X., Gu, X., Zhou, X.H., Hou, Z.G., Chen, C., Wang, S.: Towards autonomous cardiac ultrasound scanning: Combining physician expertise and machine intelligence. IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., Girshick, R.: Masked autoen- coders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000– 16009 (2022)

  11. [11]

    In: Proceedings of the 36th International Conference on Machine Learning

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. pp. 2790–2799 (2019)

  12. [12]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. CoRR abs/2106.09685(2021)

  13. [13]

    In: International Workshop on Advances in Simplifying Medical Ultrasound

    Jiang, H., Li, M., Sun, Z., Jia, N., Sun, Y ., Luo, S., Song, S., Huang, G.: Structure-aware world model for probe guidance via large-scale self-supervised pre-train. In: International Workshop on Advances in Simplifying Medical Ultrasound. pp. 58–67. Springer (2024)

  14. [14]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y ., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 190–199. Springer (2024)

  15. [15]

    arXiv preprint arXiv:2408.15026 (2024)

    Jiang, H., Sun, Z., Sun, Y ., Jia, N., Li, M., Luo, S., Song, S., Huang, G.: Sequence-aware pre-training for echocardiography probe guidance. arXiv preprint arXiv:2408.15026 (2024)

  16. [16]

    Nature Communications16(1), 7893 (2025)

    Jiang, H., Zhao, A., Yang, Q., Yan, X., Wang, T., Wang, Y ., Jia, N., Wang, J., Wu, G., Yue, Y ., et al.: Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system. Nature Communications16(1), 7893 (2025)

  17. [17]

    Medical Image Analysis96, 103202 (2024)

    Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y ., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y ., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

  18. [18]

    IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

    Li, K., Li, A., Xu, Y ., Xiong, H., Meng, M.Q.H.: Rl-tee: Au- tonomous probe guidance for transesophageal echocardiography based on attention-augmented deep reinforcement learning. IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)

  19. [19]

    Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers) (2021)

  20. [20]

    Advances in Neural Information Processing Systems36(2024)

    MH Nguyen, D., Nguyen, H., Diep, N., Pham, T.N., Cao, T., Nguyen, B., Swoboda, P., Ho, N., Albarqouni, S., Xie, P., et al.: Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems36(2024)

  21. [21]

    Journal of the American Society of Echocardiography32(1), 1–64 (2019)

    Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocar- diographic examination in adults: recommendations from the ameri- can society of echocardiography. Journal of the American Society of Echocardiogra...

  22. [22]

    JAMA cardiology6(6), 624–632 (2021)

    Narang, A., Bae, R., Hong, H., Thomas, Y ., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocar- diograms for limited diagnostic use. JAMA cardiology6(6), 624–632 (2021)

  23. [23]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  24. [24]

    Nature 580(7802), 252–256 (2020)

    Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)

  25. [25]

    Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to

  26. [26]

    Journal of the American college of cardiology70(1), 1–25 (2017)

  27. [27]

    In: International conference on machine learning

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J ´egou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347– 10357. PMLR (2021)

  28. [28]

    In: European conference on computer vision

    Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. pp. 20–36. Springer (2016)

  29. [29]

    arXiv preprint arXiv:2509.13832 (2025)

    Wang, T., Jiang, H., Wang, Y ., Sun, Z., Yan, X., Li, X., Huang, G.: Ul- trahit: A hierarchical transformer architecture for generalizable internal carotid artery robotic ultrasonography. arXiv preprint arXiv:2509.13832 (2025)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, L., Zhang, R.Y ., Wang, Y ., Xie, X.: Mma: Multi-modal adapter for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23826–23837 (2024)

  31. [31]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yue, Y ., Wang, Y ., Jiang, H., Liu, P., Song, S., Huang, G.: Echoworld: Learning motion-aware world models for echocardiography probe guid- ance. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25993–26003 (2025)

  32. [32]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., Xu, Y ., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)