Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

Hao Ding; Mathias Unberath; Yiqing Shen

arxiv: 2606.17298 · v1 · pith:Y3WGGDHTnew · submitted 2026-06-15 · 💻 cs.CV

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

Yiqing Shen , Hao Ding , Mathias Unberath This is my paper

Pith reviewed 2026-06-27 03:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-video retrievaloperating roomaction-driven digital twinsimplicit queriestemporal reasoningrobotic surgery videoLLM-based retrieval

0 comments

The pith

Action-driven digital twins let text-to-video retrieval handle implicit queries in operating room clips by reasoning over temporal action sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OR3, a retrieval method that turns operating room video clips into structured action-driven digital twins to support queries that require step-by-step reasoning rather than direct visual matching. It groups subject-action-object triplets into non-overlapping time intervals, uses an LLM to imagine matching twins from the query text, matches them inside one encoder with tailored negatives, and refines the imagined twins against real candidates. Tested on 276 implicit queries spanning four reasoning types across 386 clips from knee procedures, the approach reaches 57.6 R@1 and 77.3 R@5 while beating prior methods. This matters because safety-critical events in the OR often lack standard structure, so retrieval must distinguish visually similar clips by their action order and context.

Core claim

OR3 converts clips into action-driven digital twins by grouping concurrent subject-action-object triplets under non-overlapping temporal intervals, generates hypothetical ActDTs from text queries via LLM, performs intra-modal matching with a single encoder trained on ActDT-specific hard negatives, and applies evidence-grounded refinement that revises the imagined twins based on discrepancies with top video candidates, yielding 57.6 R@1 and 77.3 R@5 on a benchmark of 276 implicit queries over 386 robotic knee procedure clips.

What carries the argument

action-driven digital twins (ActDTs), formed by grouping concurrent subject-action-object triplets under non-overlapping temporal intervals, which carry the temporal structure needed for LLM imagination and intra-modal matching.

If this is right

Retrieval becomes possible for safety-critical events that deviate from typical procedure flow.
Fine-grained discrimination is achieved between clips that look similar but differ in action sequence or timing.
A single encoder suffices for matching without separate text and video towers.
Procedure-specific patterns can be captured through iterative refinement against real video evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interval-based triplet representation could be applied to other procedural domains such as industrial assembly or sports coaching.
Real-time ActDT extraction during live procedures might support predictive alerts before an unsafe step occurs.
Iterative refinement opens a path to conversational retrieval where a user can correct or extend the imagined twin.

Load-bearing premise

Grouping concurrent subject-action-object triplets into non-overlapping temporal intervals to form ActDTs, combined with LLM-generated hypothetical ActDTs and evidence-grounded refinement, sufficiently captures the reasoning needed to identify the correct clip for implicit queries.

What would settle it

A controlled test set of query-clip pairs where two clips differ only in the temporal ordering of two overlapping actions; if OR3 retrieval accuracy drops to baseline levels on these pairs, the ActDT interval grouping does not capture the required reasoning.

Figures

Figures reproduced from arXiv: 2606.17298 by Hao Ding, Mathias Unberath, Yiqing Shen.

**Figure 1.** Figure 1: Overview of OR3 . Each OR video clip Ci is converted into an actual ActDT through vision foundation models (Qwen3-VL, SAM-3, DepthAnythingV3), encoding subject-action-object triplets within non-overlapping temporal intervals. Given an implicit query q and OR metadata K, an LLM imagines a hypothetical ActDT ActDT \ q that predicts the action primitives the target clip would contain. A shared text encoder … view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on the OR reasoning retrieval benchmark across four query categories. Red borders indicate incorrect retrievals where methods return clips that do not match the query, while green borders denote correct retrievals by OR3 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Additional analyses of OR3 . (a) R@1 per query category as a function of evidence-grounded refinement round t, where t=0 corresponds to the initial imagined ActDT before any refinement. (b) Sensitivity to the number of top-K candidates forwarded from imagination-based retrieval to refinement, with the star marking the selected operating point (K=10). (c) R@1 of all compared methods as the retrieval corpus… view at source ↗

read the original abstract

Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OR3 shifts text-to-video retrieval to intra-modal matching via ActDTs and LLM imagination, delivering clear gains on implicit OR queries but resting on untested assumptions about temporal grouping and LLM reliability.

read the letter

The main takeaway is that this paper moves away from standard cross-modal embeddings toward converting OR clips into structured action-driven digital twins and letting an LLM imagine matching representations for retrieval. That change lets it handle implicit queries like "the step right before clipping" where global methods fall short.

What stands out as new is the ActDT construction from subject-action-object triplets grouped into non-overlapping intervals, the imagination step that generates hypothetical ActDTs from text, and the evidence-grounded refinement that adjusts those guesses against top candidates. They also train with ActDT-specific hard negatives. The benchmark of 276 queries across four reasoning categories pulled from MM-OR is a concrete addition, and the reported 57.6 R@1 and 77.3 R@5 beat the strongest baseline.

The numbers look useful for the targeted OR safety use case. The pipeline is straightforward to describe and the results show fine-grained discrimination on visually similar clips.

The soft spots are around the core assumptions. Grouping concurrent triplets into non-overlapping intervals could drop ordering or overlap details that matter for queries hinging on sequence. The abstract gives no numbers on triplet extraction accuracy or how often the refinement step actually corrects LLM outputs versus introducing new errors. Because the benchmark is built for this paper, it is hard to judge whether the query set or category splits favor the method.

This is worth sending to peer review. The idea is distinct enough and the task is practical, so referees can check the implementation details, baseline fairness, and whether the temporal grouping holds up on the actual data.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OR3 for text-to-video retrieval on operating room clips, converting videos to action-driven digital twins (ActDTs) by grouping concurrent subject-action-object triplets into non-overlapping temporal intervals. It performs imagination-based retrieval by having an LLM generate hypothetical ActDTs from implicit queries (e.g., 'the step right before clipping'), followed by intra-modal matching via a single encoder trained with ActDT-tailored hard negatives and evidence-grounded refinement of the imagined ActDTs. A new benchmark is constructed from MM-OR comprising 276 implicit queries across four reasoning categories on 386 robotic knee procedure clips; OR3 reports 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline and enabling fine-grained discrimination via temporal action reasoning.

Significance. If the central results hold, the work offers a concrete advance in reasoning-capable retrieval for safety-critical procedural video, moving beyond global embeddings to handle implicit queries that standard methods cannot address. The ActDT representation, LLM imagination pipeline, and new benchmark provide a reusable framework with direct applicability to OR safety inspection; the reported margins on a non-trivial query set constitute a falsifiable starting point for the community.

major comments (2)

[Abstract and ActDT construction description] The central claim (57.6 R@1 on implicit queries) rests on the assertion that grouping concurrent triplets into non-overlapping temporal intervals plus LLM-generated/refined ActDTs suffices to model the required reasoning. The manuscript provides no quantitative assessment of triplet extraction accuracy, information loss from discarding overlaps/ordering, or how often evidence-grounded refinement corrects hallucinations; without these, the performance margin cannot be attributed to the proposed mechanism rather than benchmark artifacts.
[Evaluation / benchmark section] Benchmark construction (276 queries, four reasoning categories, 386 clips from MM-OR): the outperformance claim is load-bearing on the queries being genuinely implicit and free of selection effects, yet no details are given on query generation process, inter-annotator agreement, or per-category breakdown of results, making it impossible to rule out that the reported gains are driven by easier subsets or construction biases.

minor comments (2)

The four reasoning categories are referenced but not explicitly defined or exemplified; adding a table or appendix listing representative queries per category would improve reproducibility.
Clarify the exact form of the ActDT representation (e.g., how intervals are encoded for the single encoder) and whether any temporal ordering information is explicitly preserved beyond the non-overlapping grouping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for additional validation of the ActDT pipeline and benchmark construction. We address each major comment below and will incorporate revisions to strengthen the attribution of results to the proposed mechanisms.

read point-by-point responses

Referee: [Abstract and ActDT construction description] The central claim (57.6 R@1 on implicit queries) rests on the assertion that grouping concurrent triplets into non-overlapping temporal intervals plus LLM-generated/refined ActDTs suffices to model the required reasoning. The manuscript provides no quantitative assessment of triplet extraction accuracy, information loss from discarding overlaps/ordering, or how often evidence-grounded refinement corrects hallucinations; without these, the performance margin cannot be attributed to the proposed mechanism rather than benchmark artifacts.

Authors: We agree that quantitative assessments of triplet extraction accuracy, the effects of discarding overlaps/ordering, and the correction rate of evidence-grounded refinement would strengthen the paper and help attribute gains to the ActDT representation. The current manuscript focuses on end-to-end retrieval performance and does not include these component-level analyses. In revision we will add: (1) triplet extraction accuracy measured against available annotations in MM-OR, (2) an analysis of information loss by comparing ActDTs with and without overlap handling on a subset of clips, and (3) statistics on the frequency and nature of changes introduced by the refinement step across the 276 queries. revision: yes
Referee: [Evaluation / benchmark section] Benchmark construction (276 queries, four reasoning categories, 386 clips from MM-OR): the outperformance claim is load-bearing on the queries being genuinely implicit and free of selection effects, yet no details are given on query generation process, inter-annotator agreement, or per-category breakdown of results, making it impossible to rule out that the reported gains are driven by easier subsets or construction biases.

Authors: We agree that explicit documentation of the query generation process, inter-annotator agreement, and per-category results is necessary to substantiate that the queries are implicit and that gains are not driven by construction artifacts. The manuscript states only the final counts and categories. In the revision we will add: (1) a detailed description of the expert-driven query generation protocol used to create the 276 implicit queries across the four reasoning categories, (2) inter-annotator agreement statistics computed during query validation, and (3) a per-category breakdown of R@1 and R@5 for OR3 and baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; new components and benchmark are self-contained

full rationale

The paper defines ActDTs, the non-overlapping triplet grouping procedure, LLM imagination step, hard-negative training, and evidence-grounded refinement entirely within the present work; the benchmark is newly built from MM-OR rather than reusing prior fitted quantities. No equations, uniqueness claims, or performance predictions reduce by construction to self-citations or to the inputs themselves. The reported R@1/R@5 numbers are empirical outcomes on the new implicit-query set and do not collapse to tautological renaming or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the new ActDT representation and LLM imagination step being sufficient for implicit reasoning queries; these are introduced without external independent evidence in the abstract.

axioms (1)

domain assumption Action-driven digital twins formed by grouping concurrent subject-action-object triplets under non-overlapping temporal intervals capture the temporal and semantic structure needed for reasoning over implicit OR queries.
This representation is foundational to the intra-modal matching and refinement steps described.

invented entities (1)

Action-driven digital twins (ActDTs) no independent evidence
purpose: Structured representation of video clips for temporal action reasoning in retrieval.
New entity introduced to enable the proposed imagination-based matching; no independent evidence outside this work is mentioned.

pith-pipeline@v0.9.1-grok · 5781 in / 1396 out tokens · 55969 ms · 2026-06-27T03:23:38.947931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

SAM 3: Segment Anything with Concepts

Nicolas Carion et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Cross-modal video retrieval model based on video- text dual alignment.International Journal of Advanced Computer Science & Ap- plications, 15(2), 2024

Zhanbin Che and Huaili Guo. Cross-modal video retrieval model based on video- text dual alignment.International Journal of Advanced Computer Science & Ap- plications, 15(2), 2024

2024
[4]

A sim- ple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A sim- ple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[5]

Tecno: Surgical phase recognition with multi-stage temporal convolutional networks

Tobias Czempiel et al. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. InInternational conference on medical image computing and computer-assisted intervention, pages 343–352. Springer, 2020

2020
[6]

Therbligsinaction:Videounderstandingthroughmotionprimitives

Eadom Dessalene, Michael Maynord, Cornelia Fermüller, and Yiannis Aloimonos. Therbligsinaction:Videounderstandingthroughmotionprimitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10618–10626, 2023

2023
[7]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[8]

Video-based tools for surgical quality assessment of technical skills in laparoscopic procedures: a systematic review.Surgical endoscopy, 37(6):4279–4297, 2023

Alexander AJ Grüter et al. Video-based tools for surgical quality assessment of technical skills in laparoscopic procedures: a systematic review.Surgical endoscopy, 37(6):4279–4297, 2023

2023
[9]

Grant M Henning et al. A step toward modernization of urologic training: Incor- poration of a novel surgical intelligence platform for robotic prostatectomy video review.Journal of endourology, 39(11):1204–1210, 2025

2025
[10]

Surgical data recording in the operating room: a systematic review of modalities and metrics.British Journal of Surgery, 108(6):613–621, 2021

Marc Levin et al. Surgical data recording in the operating room: a systematic review of modalities and metrics.British Journal of Surgery, 108(6):613–621, 2021

2021
[11]

Fcot-vl: Advancing text-oriented large vision-language models with efficient visual token compression.arXiv preprint arXiv:2502.18512, 2025

Jianjian Li et al. Fcot-vl: Advancing text-oriented large vision-language models with efficient visual token compression.arXiv preprint arXiv:2502.18512, 2025

work page arXiv 2025
[12]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin et al. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021

Huaishao Luo et al. Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021

work page arXiv 2021
[14]

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Yiwei Ma et al. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022. 10 Y. Shen et al

2022
[15]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Mm-or: A large multimodal operating room dataset for seman- tic understanding of high-intensity surgical environments

Ege Özsoy et al. Mm-or: A large multimodal operating room dataset for seman- tic understanding of high-intensity surgical environments. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19378–19389, 2025

2025
[17]

Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

JoshuaRobinson,Ching-YaoChuang,SuvritSra,andStefanieJegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010
[18]

fine-clip: Enhancing zero-shot fine-grained surgical action recognition with vision-language models.arXiv preprint arXiv:2503.19670, 2025

Saurav Sharma et al. fine-clip: Enhancing zero-shot fine-grained surgical action recognition with vision-language models.arXiv preprint arXiv:2503.19670, 2025

work page arXiv 2025
[19]

Online reasoning video segmentation with just-in-time digital twins

Yiqing Shen et al. Online reasoning video segmentation with just-in-time digital twins. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24698–24706, 2025

2025
[20]

Reasoning text-to-video retrieval via digital twin video repre- sentations and large language models.arXiv preprint arXiv:2511.12371, 2025

Yiqing Shen et al. Reasoning text-to-video retrieval via digital twin video repre- sentations and large language models.arXiv preprint arXiv:2511.12371, 2025

work page arXiv 2025
[21]

Temporally-constrained video reasoning segmentation and auto- mated benchmark construction

Yiqing Shen et al. Temporally-constrained video reasoning segmentation and auto- mated benchmark construction. InInternational Workshop on Foundation Models for General Medical AI, pages 150–158. Springer, 2025

2025
[22]

Operating room workflow analysis via reasoning segmentation over digital twins

Yiqing Shen, Chenjia Li, Bohan Liu, Cheng-Yi Li, Tito Porras, and Mathias Un- berath. Operating room workflow analysis via reasoning segmentation over digital twins. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 415–424. Springer, 2025

2025
[23]

Teachclip: Multi-grained teaching for efficient text-to-video re- trieval.arXiv preprint arXiv:2308.01217, 2023

Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui Kang, and Xirong Li. Teachclip: Multi-grained teaching for efficient text-to-video re- trieval.arXiv preprint arXiv:2308.01217, 2023

work page arXiv 2023
[24]

T2vlad: global-local sequence alignment for text-video re- trieval

Xiaohan Wang et al. T2vlad: global-local sequence alignment for text-video re- trieval. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 5079–5088, 2021

2021
[25]

Internvideo2: Scaling foundation models for multimodal video un- derstanding

Yi Wang et al. Internvideo2: Scaling foundation models for multimodal video un- derstanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024

2024
[26]

Learning surgical skills through video-based education: a systematic review.Surgical Innovation, 30(2):220–238, 2023

Samy Cheikh Youssef et al. Learning surgical skills through video-based education: a systematic review.Surgical Innovation, 30(2):220–238, 2023

2023
[27]

Live laparoscopic video retrieval with compressed uncertainty

Tong Yu et al. Live laparoscopic video retrieval with compressed uncertainty. Medical Image Analysis, 88:102866, 2023

2023
[28]

Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

Haonan Zhang et al. Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

2025

[1] [1]

Qwen3-VL Technical Report

Shuai Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

SAM 3: Segment Anything with Concepts

Nicolas Carion et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Cross-modal video retrieval model based on video- text dual alignment.International Journal of Advanced Computer Science & Ap- plications, 15(2), 2024

Zhanbin Che and Huaili Guo. Cross-modal video retrieval model based on video- text dual alignment.International Journal of Advanced Computer Science & Ap- plications, 15(2), 2024

2024

[4] [4]

A sim- ple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A sim- ple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020

[5] [5]

Tecno: Surgical phase recognition with multi-stage temporal convolutional networks

Tobias Czempiel et al. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. InInternational conference on medical image computing and computer-assisted intervention, pages 343–352. Springer, 2020

2020

[6] [6]

Therbligsinaction:Videounderstandingthroughmotionprimitives

Eadom Dessalene, Michael Maynord, Cornelia Fermüller, and Yiannis Aloimonos. Therbligsinaction:Videounderstandingthroughmotionprimitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10618–10626, 2023

2023

[7] [7]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186, 2019

2019

[8] [8]

Video-based tools for surgical quality assessment of technical skills in laparoscopic procedures: a systematic review.Surgical endoscopy, 37(6):4279–4297, 2023

Alexander AJ Grüter et al. Video-based tools for surgical quality assessment of technical skills in laparoscopic procedures: a systematic review.Surgical endoscopy, 37(6):4279–4297, 2023

2023

[9] [9]

Grant M Henning et al. A step toward modernization of urologic training: Incor- poration of a novel surgical intelligence platform for robotic prostatectomy video review.Journal of endourology, 39(11):1204–1210, 2025

2025

[10] [10]

Surgical data recording in the operating room: a systematic review of modalities and metrics.British Journal of Surgery, 108(6):613–621, 2021

Marc Levin et al. Surgical data recording in the operating room: a systematic review of modalities and metrics.British Journal of Surgery, 108(6):613–621, 2021

2021

[11] [11]

Fcot-vl: Advancing text-oriented large vision-language models with efficient visual token compression.arXiv preprint arXiv:2502.18512, 2025

Jianjian Li et al. Fcot-vl: Advancing text-oriented large vision-language models with efficient visual token compression.arXiv preprint arXiv:2502.18512, 2025

work page arXiv 2025

[12] [12]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin et al. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021

Huaishao Luo et al. Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021

work page arXiv 2021

[14] [14]

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Yiwei Ma et al. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022. 10 Y. Shen et al

2022

[15] [15]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Mm-or: A large multimodal operating room dataset for seman- tic understanding of high-intensity surgical environments

Ege Özsoy et al. Mm-or: A large multimodal operating room dataset for seman- tic understanding of high-intensity surgical environments. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19378–19389, 2025

2025

[17] [17]

Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

JoshuaRobinson,Ching-YaoChuang,SuvritSra,andStefanieJegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010

[18] [18]

fine-clip: Enhancing zero-shot fine-grained surgical action recognition with vision-language models.arXiv preprint arXiv:2503.19670, 2025

Saurav Sharma et al. fine-clip: Enhancing zero-shot fine-grained surgical action recognition with vision-language models.arXiv preprint arXiv:2503.19670, 2025

work page arXiv 2025

[19] [19]

Online reasoning video segmentation with just-in-time digital twins

Yiqing Shen et al. Online reasoning video segmentation with just-in-time digital twins. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24698–24706, 2025

2025

[20] [20]

Reasoning text-to-video retrieval via digital twin video repre- sentations and large language models.arXiv preprint arXiv:2511.12371, 2025

Yiqing Shen et al. Reasoning text-to-video retrieval via digital twin video repre- sentations and large language models.arXiv preprint arXiv:2511.12371, 2025

work page arXiv 2025

[21] [21]

Temporally-constrained video reasoning segmentation and auto- mated benchmark construction

Yiqing Shen et al. Temporally-constrained video reasoning segmentation and auto- mated benchmark construction. InInternational Workshop on Foundation Models for General Medical AI, pages 150–158. Springer, 2025

2025

[22] [22]

Operating room workflow analysis via reasoning segmentation over digital twins

Yiqing Shen, Chenjia Li, Bohan Liu, Cheng-Yi Li, Tito Porras, and Mathias Un- berath. Operating room workflow analysis via reasoning segmentation over digital twins. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 415–424. Springer, 2025

2025

[23] [23]

Teachclip: Multi-grained teaching for efficient text-to-video re- trieval.arXiv preprint arXiv:2308.01217, 2023

Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui Kang, and Xirong Li. Teachclip: Multi-grained teaching for efficient text-to-video re- trieval.arXiv preprint arXiv:2308.01217, 2023

work page arXiv 2023

[24] [24]

T2vlad: global-local sequence alignment for text-video re- trieval

Xiaohan Wang et al. T2vlad: global-local sequence alignment for text-video re- trieval. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 5079–5088, 2021

2021

[25] [25]

Internvideo2: Scaling foundation models for multimodal video un- derstanding

Yi Wang et al. Internvideo2: Scaling foundation models for multimodal video un- derstanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024

2024

[26] [26]

Learning surgical skills through video-based education: a systematic review.Surgical Innovation, 30(2):220–238, 2023

Samy Cheikh Youssef et al. Learning surgical skills through video-based education: a systematic review.Surgical Innovation, 30(2):220–238, 2023

2023

[27] [27]

Live laparoscopic video retrieval with compressed uncertainty

Tong Yu et al. Live laparoscopic video retrieval with compressed uncertainty. Medical Image Analysis, 88:102866, 2023

2023

[28] [28]

Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

Haonan Zhang et al. Text-video retrieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

2025