HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

Haochao Ying; Jian Wu; Kun Yuan; Nassir Navab; Nicolas Padoy; Yaojun Hu

arxiv: 2606.31245 · v1 · pith:XM43CDK3new · submitted 2026-06-30 · 💻 cs.CV

HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

Yaojun Hu , Kun Yuan , Nassir Navab , Haochao Ying , Jian Wu , Nicolas Padoy This is my paper

Pith reviewed 2026-07-01 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords hyperbolic embeddingssurgical video-language pre-traininghierarchical structurephase recognitionzero-shot learningfew-shot learningcomputer visionmedical imaging

0 comments

The pith

Hyperbolic embeddings for surgical video and language keep procedure hierarchies intact and improve zero-shot phase recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surgical materials encode layered knowledge where fine actions sit inside mid-level steps that compose global phases. Most video-language models collapse these signals into a single flat space and treat items that share only procedural context as unrelated. The paper instead pre-trains in hyperbolic space, adds losses that stop procedural context from creating false negatives, and adds a consistency term that aligns parent-phase descriptions with their child steps. The resulting representations produce measurable lifts in zero- and few-shot phase recognition on benchmarks that span different procedures and hospitals.

Core claim

We propose HyperVLP, a hyperbolic surgical video-language pre-training framework that explicitly preserves the hierarchical structure by mitigating structural false negatives induced by procedural context and enforcing semantic consistency between parent phases and their constituent child steps. Extensive experiments on multiple surgical benchmarks show consistent gains in zero- and few-shot phase recognition across procedures and institutions.

What carries the argument

Hyperbolic video-language pre-training that mitigates structural false negatives from shared procedural context and enforces semantic consistency between phases and steps.

If this is right

Gains appear in zero-shot phase recognition on multiple surgical benchmarks.
Improvements hold when the test videos come from different procedures and institutions.
Long-range dependency modeling improves because cross-level containment is no longer collapsed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchy-preserving losses could be tested on instructional video domains outside surgery, such as cooking or assembly tasks.
If the gains vanish once the language hierarchy is removed from the training data, that would isolate the contribution of the geometric choice.
The framework suggests that flat embedding spaces may systematically under-model containment relations in any domain with natural parent-child structure.

Load-bearing premise

The multi-level containment relations in surgical narration, headings, and abstracts can be faithfully turned into geometric containment in hyperbolic space by the two proposed consistency mechanisms without creating offsetting mismatches.

What would settle it

An ablation that trains the identical architecture and losses in Euclidean space instead of hyperbolic space and finds no accuracy drop on zero-shot phase recognition across the same benchmarks.

Figures

Figures reproduced from arXiv: 2606.31245 by Haochao Ying, Jian Wu, Kun Yuan, Nassir Navab, Nicolas Padoy, Yaojun Hu.

**Figure 2.** Figure 2: Distribution of hyperbolic distances to the origin for video and text rep [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Surgical vision-language foundation models typically adopt educational materials, such as surgical lecture videos, to transfer surgical knowledge encoded in language into visual representations. These knowledge are multi-dimensional and hierarchical: fine-grained action cues appear in narration, mid-level key steps are summarized in subsection headings, and global procedural context, such as patient history and surgical strategy, is described in abstract texts. Prior work largely collapses these heterogeneous signals into a single flat embedding space, implicitly assuming independence across hierarchy levels. However, this is suboptimal because it ignores cross-level semantic containment, e.g., actions belong to steps, steps compose phases, weakens long-range dependency modeling. To this end, we propose a hyperbolic surgical video-language pre-training framework that explicitly preserves the hierarchical structure by mitigating structural false negatives induced by procedural context and enforcing semantic consistency between parent phases and their constituent child steps. Extensive experiments on multiple surgical benchmarks show consistent gains in zero- and few-shot phase recognition across procedures and institutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hyperbolic embeddings for hierarchy in surgical VLP is a targeted fix worth checking, but the abstract supplies no equations or numbers so the gains cannot be assessed yet.

read the letter

The main takeaway is that this paper moves surgical video-language pretraining into hyperbolic space to keep the containment relations between phases, steps, and actions intact instead of flattening everything into one Euclidean space. They add two mechanisms, one to cut structural false negatives from procedural context and one to enforce parent-child consistency.

What is new is the explicit application of hyperbolic geometry to this surgical hierarchy problem, paired with those consistency terms. Prior surgical VLP work treated the levels as independent, so the geometric choice directly targets that gap.

The paper does a clean job spelling out why flat embeddings lose long-range structure when narration, headings, and abstracts sit at different granularities. Hyperbolic space is a standard tool for hierarchies, so the move is logical for this data.

The soft spots are straightforward. The abstract states the claim and says experiments show gains, but it contains no loss functions, no embedding construction details, no dataset sizes, and no tables. Without those pieces the size of the improvement and whether the mechanisms actually deliver it stay unknown. The assumption that the two consistency terms will improve results without creating new mismatches is plausible but untested in the visible text.

This is for people already working on video-language models in surgery or other domains with clear procedural hierarchies. A reader focused on geometric embeddings for multimodal data could extract something useful if the full methods and results hold up.

It deserves a serious referee because the idea is narrow, addresses a stated limitation, and uses a known geometric property in a relevant setting. Send it to review.

Referee Report

1 major / 0 minor

Summary. The paper proposes HyperVLP, a hyperbolic surgical video-language pre-training framework that explicitly preserves hierarchical structure in surgical videos by mitigating structural false negatives induced by procedural context and enforcing semantic consistency between parent phases and their constituent child steps. It reports consistent gains in zero- and few-shot phase recognition across multiple surgical benchmarks, procedures, and institutions.

Significance. If the central mechanisms can be shown to deliver measurable improvements without introducing new mismatches, the work would provide a concrete demonstration of hyperbolic geometry's utility for modeling multi-level semantic containment (actions within steps within phases) in procedural video-language models, addressing a limitation of flat Euclidean embeddings.

major comments (1)

[Abstract] Abstract: The abstract states the claim and reports gains but supplies no equations, training details, dataset descriptions, or quantitative tables; therefore the data and derivations that would support the claim cannot be examined.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below, noting that the abstract follows standard conventions for conciseness while the full manuscript supplies all requested details.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states the claim and reports gains but supplies no equations, training details, dataset descriptions, or quantitative tables; therefore the data and derivations that would support the claim cannot be examined.

Authors: Abstracts are intentionally limited to high-level claims and results to respect length constraints (typically 150-250 words). All supporting elements are provided in the manuscript: the hyperbolic loss and hierarchy-preserving objectives appear in Section 3 (Equations 1-4), training procedure and hyperparameters in Section 4, dataset descriptions and splits in Section 5, and quantitative tables (zero-shot and few-shot results across benchmarks) in Section 6 (Tables 1-4). The abstract's reported gains are therefore directly traceable to these sections. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and provided context describe a proposed hyperbolic video-language pre-training framework that mitigates structural false negatives and enforces parent-child semantic consistency to preserve hierarchy. No equations, loss functions, derivations, fitted parameters, or self-citations appear in the text that could be inspected for reductions by construction. The central claim introduces new mechanisms without any quoted step that equates a prediction or result to its own inputs, a fitted quantity renamed as output, or a load-bearing self-citation chain. This matches the reader's assessment of no detectable circularity in the derivation chain, making the finding self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the entire assessment rests on the high-level textual description alone.

pith-pipeline@v0.9.1-grok · 5706 in / 1178 out tokens · 41948 ms · 2026-07-01T05:56:14.891248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Flavors of geometry31(59-115), 2 (1997)

Cannon, J.W., Floyd, W.J., Kenyon, R., Parry, W.R., et al.: Hyperbolic geometry. Flavors of geometry31(59-115), 2 (1997)

1997
[2]

In: Proceedings of the 40th International Conference on Machine Learning

Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., Vedantam, R.: Hyperbolic image-text representations. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

2023
[3]

In: International conference on machine learning

Ganea, O., Bécigneul, G., Hofmann, T.: Hyperbolic entailment cones for learn- ing hierarchical embeddings. In: International conference on machine learning. pp. 1646–1655. PMLR (2018)

2018
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[5]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016
[6]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

He, Y., Zhu, Y., Fu, P., Yang, R., Chen, T., Wang, Z., Li, Q., Zhou, P., Yang, X., Wang, S.: Endo-clip: Progressive self-supervised pre-training on raw colonoscopy records. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 106–116. Springer (2025)

2025
[7]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

International journal of computer assisted radiology and surgery pp

Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery pp. 1–9 (2024) 10 Authors Suppressed Due to ...

2024
[10]

In: Korhonen, A., Traum, D., Màrquez, L

Le, M., Roller, S., Papaxanthos, L., Kiela, D., Nickel, M.: Inferring concept hier- archies from text corpora via hyperbolic embeddings. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics. pp. 3231–3241. Association for Computational Linguistics, Florence, Italy (Jul 20...

work page doi:10.18653/v1/p19-1313 2019
[11]

In: International confer- ence on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

2022
[12]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R

Lin,W.,Zhao,Z.,Zhang,X.,Wu,C.,Zhang,Y.,Wang,Y.,Xie,W.:Pmc-clip:Con- trastive language-image pre-training using biomedical documents. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 525–536. Springer Nature Switzerla...

2023
[13]

arXiv preprint arXiv:2501.07468 (2025)

Liu, Y., Cao, X., Chen, T., Jiang, Y., You, J., Wu, M., Wang, X., Feng, M., Jin, Y., Chen, J.: A survey of embodied ai in healthcare: Techniques, applications, and opportunities. arXiv preprint arXiv:2501.07468 (2025)

work page arXiv 2025
[14]

Maier-Hein, L., Eisenmann, M., Sarikaya, D., März, K., Collins, T., Malpani, A., Fallert, J., Feussner, H., Giannarou, S., Mascagni, P., et al.: Surgical data science– fromconceptstowardclinicaltranslation.Medicalimageanalysis76,102306(2022)

2022
[15]

In: Proceedings of the IEEE/CVF international conference on computer vision

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million nar- rated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2630–2640 (2019)

2019
[16]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

Pal, A., Van Spengler, M., di Melendugno, G.M.D., Flaborea, A., Galasso, F., Mettes, P.: Compositional entailment learning for hyperbolic vision-language mod- els. arXiv preprint arXiv:2410.06912 (2024)

work page arXiv 2024
[18]

103982 (2026)

Perez, A., Nwoye, C., Kermani, R.R., Mohareri, O., Jamal, M.A.: Surglavi: Large- scalehierarchicaldatasetforsurgicalvision–languagerepresentationlearning.Med- ical Image Analysis p. 103982 (2026)

2026
[19]

In: proceed- ings of Medical Image Computing and Computer Assisted Intervention – MICCAI

Pérez, A., Rodríguez, S., Ayobi, N., Aparicio, N., Dessevres, E., Arbeláez, P.: MuST: Multi-Scale Transformers for Surgical Phase Recognition . In: proceed- ings of Medical Image Computing and Computer Assisted Intervention – MICCAI
[20]

LNCS 15006

vol. LNCS 15006. Springer Nature Switzerland (October 2024)

2024
[21]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

2021
[22]

arXiv preprint arXiv:2403.05949 (2024)

Schmidgall, S., Kim, J.W., Jopling, J., Krieger, A.: General surgery vision trans- former: A video pre-trained foundation model for general surgery. arXiv preprint arXiv:2403.05949 (2024)

work page arXiv 2024
[23]

IEEE transactions on medical imaging36(1), 86–97 (2016)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging36(1), 86–97 (2016)

2016
[24]

Wang, Z., Liu, C., Zhang, S., Dou, Q.: Foundation model for endoscopy video anal- ysisvialarge-scaleself-supervisedpre-train.In:Internationalconferenceonmedical image computing and computer-assisted intervention. pp. 101–111. Springer (2023) Title Suppressed Due to Excessive Length 11

2023
[25]

In: Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from un- paired medical images and text. In: Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing. pp. 3876–3887 (2022)

2022
[26]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)

2022
[27]

In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

Yang, S., Luo, L., Wang, Q., Chen, H.: Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition . In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. vol. LNCS 15006. Springer Nature Switzerland (October 2024)

2024
[28]

Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

2024
[29]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)

2024
[30]

Medical Image Analysis105, 103644 (2025)

Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis105, 103644 (2025)

2025

[1] [1]

Flavors of geometry31(59-115), 2 (1997)

Cannon, J.W., Floyd, W.J., Kenyon, R., Parry, W.R., et al.: Hyperbolic geometry. Flavors of geometry31(59-115), 2 (1997)

1997

[2] [2]

In: Proceedings of the 40th International Conference on Machine Learning

Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., Vedantam, R.: Hyperbolic image-text representations. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

2023

[3] [3]

In: International conference on machine learning

Ganea, O., Bécigneul, G., Hofmann, T.: Hyperbolic entailment cones for learn- ing hierarchical embeddings. In: International conference on machine learning. pp. 1646–1655. PMLR (2018)

2018

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020

[5] [5]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016

[6] [6]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

He, Y., Zhu, Y., Fu, P., Yang, R., Chen, T., Wang, Z., Li, Q., Zhou, P., Yang, X., Wang, S.: Endo-clip: Progressive self-supervised pre-training on raw colonoscopy records. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 106–116. Springer (2025)

2025

[7] [7]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[8] [8]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

International journal of computer assisted radiology and surgery pp

Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery pp. 1–9 (2024) 10 Authors Suppressed Due to ...

2024

[10] [10]

In: Korhonen, A., Traum, D., Màrquez, L

Le, M., Roller, S., Papaxanthos, L., Kiela, D., Nickel, M.: Inferring concept hier- archies from text corpora via hyperbolic embeddings. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics. pp. 3231–3241. Association for Computational Linguistics, Florence, Italy (Jul 20...

work page doi:10.18653/v1/p19-1313 2019

[11] [11]

In: International confer- ence on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

2022

[12] [12]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R

Lin,W.,Zhao,Z.,Zhang,X.,Wu,C.,Zhang,Y.,Wang,Y.,Xie,W.:Pmc-clip:Con- trastive language-image pre-training using biomedical documents. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 525–536. Springer Nature Switzerla...

2023

[13] [13]

arXiv preprint arXiv:2501.07468 (2025)

Liu, Y., Cao, X., Chen, T., Jiang, Y., You, J., Wu, M., Wang, X., Feng, M., Jin, Y., Chen, J.: A survey of embodied ai in healthcare: Techniques, applications, and opportunities. arXiv preprint arXiv:2501.07468 (2025)

work page arXiv 2025

[14] [14]

Maier-Hein, L., Eisenmann, M., Sarikaya, D., März, K., Collins, T., Malpani, A., Fallert, J., Feussner, H., Giannarou, S., Mascagni, P., et al.: Surgical data science– fromconceptstowardclinicaltranslation.Medicalimageanalysis76,102306(2022)

2022

[15] [15]

In: Proceedings of the IEEE/CVF international conference on computer vision

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million nar- rated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2630–2640 (2019)

2019

[16] [16]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

Pal, A., Van Spengler, M., di Melendugno, G.M.D., Flaborea, A., Galasso, F., Mettes, P.: Compositional entailment learning for hyperbolic vision-language mod- els. arXiv preprint arXiv:2410.06912 (2024)

work page arXiv 2024

[18] [18]

103982 (2026)

Perez, A., Nwoye, C., Kermani, R.R., Mohareri, O., Jamal, M.A.: Surglavi: Large- scalehierarchicaldatasetforsurgicalvision–languagerepresentationlearning.Med- ical Image Analysis p. 103982 (2026)

2026

[19] [19]

In: proceed- ings of Medical Image Computing and Computer Assisted Intervention – MICCAI

Pérez, A., Rodríguez, S., Ayobi, N., Aparicio, N., Dessevres, E., Arbeláez, P.: MuST: Multi-Scale Transformers for Surgical Phase Recognition . In: proceed- ings of Medical Image Computing and Computer Assisted Intervention – MICCAI

[20] [20]

LNCS 15006

vol. LNCS 15006. Springer Nature Switzerland (October 2024)

2024

[21] [21]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

2021

[22] [22]

arXiv preprint arXiv:2403.05949 (2024)

Schmidgall, S., Kim, J.W., Jopling, J., Krieger, A.: General surgery vision trans- former: A video pre-trained foundation model for general surgery. arXiv preprint arXiv:2403.05949 (2024)

work page arXiv 2024

[23] [23]

IEEE transactions on medical imaging36(1), 86–97 (2016)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging36(1), 86–97 (2016)

2016

[24] [24]

Wang, Z., Liu, C., Zhang, S., Dou, Q.: Foundation model for endoscopy video anal- ysisvialarge-scaleself-supervisedpre-train.In:Internationalconferenceonmedical image computing and computer-assisted intervention. pp. 101–111. Springer (2023) Title Suppressed Due to Excessive Length 11

2023

[25] [25]

In: Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from un- paired medical images and text. In: Proceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing. pp. 3876–3887 (2022)

2022

[26] [26]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)

2022

[27] [27]

In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

Yang, S., Luo, L., Wang, Q., Chen, H.: Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition . In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. vol. LNCS 15006. Springer Nature Switzerland (October 2024)

2024

[28] [28]

Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

2024

[29] [29]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)

2024

[30] [30]

Medical Image Analysis105, 103644 (2025)

Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis105, 103644 (2025)

2025