Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

Han Zhang; Mathias Unberath; Yiqing Shen

arxiv: 2606.17279 · v1 · pith:WAFVJ4L3new · submitted 2026-06-15 · 💻 cs.CV

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

Yiqing Shen , Han Zhang , Mathias Unberath This is my paper

Pith reviewed 2026-06-27 03:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical video question answeringreinforcement learningdigital twin representationslarge language modelsmulti-step reasoningcolonoscopyvideoqa benchmarkshierarchical representations

0 comments

The pith

Training LLMs with reinforcement learning over digital twin representations decouples perception from reasoning and improves performance on surgical video question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing VideoQA methods compress videos into discrete tokens, which breaks continuous spatial-temporal relationships and limits multi-step reasoning. It introduces an RL training framework where LLMs operate directly on digital twin representations built from surgical foundation models, using hierarchical structures at frame, window, and procedure levels that include uncertainty estimates. A new reward function validates output format while scoring accuracy through clinical plausibility checks and uncertainty-aware calibration. The authors also release REAL-Colon-Reason, a benchmark of 2000 questions at three complexity levels. Experiments show state-of-the-art results on this new set and on REAL-Colon-VQA and EndoVis18-VQA.

Core claim

An RL framework trains LLMs to operate over digital twin representations from surgical foundation models, thereby decoupling perception from reasoning; the framework adds hierarchical probabilistic representations across frame, temporal-window, and procedure levels together with a reward that combines format validation, accuracy, clinical plausibility evaluation, and uncertainty-aware calibration, yielding state-of-the-art results on REAL-Colon-Reason and two prior surgical VideoQA benchmarks.

What carries the argument

Reinforcement learning over hierarchical digital twin representations (frame, temporal window, procedure levels) with uncertainty estimates and a reward that merges format validation, clinical plausibility, and uncertainty-aware calibration.

If this is right

Multi-step reasoning across semantic, spatial, and temporal dimensions becomes feasible without token-level compression artifacts.
Hierarchical representations with uncertainty estimates allow the model to handle varying levels of procedural complexity.
The clinical-plausibility component of the reward improves alignment with medical validity beyond simple accuracy.
State-of-the-art results transfer to existing surgical VideoQA benchmarks without task-specific architectural changes.
The new REAL-Colon-Reason dataset provides graded evaluation of reasoning depth in colonoscopy videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling strategy could be tested on non-surgical temporal reasoning tasks such as sports video analysis or robotic manipulation sequences.
Uncertainty estimates produced at each hierarchy level might be used directly for human-in-the-loop clinical review.
If digital twins already encode the needed geometry, future work could explore whether the perception module itself can be frozen rather than jointly trained.
The reward design combining plausibility and calibration offers a template for other domains where correctness is not purely factual.

Load-bearing premise

Digital twin representations constructed from surgical foundation models preserve continuous spatial-temporal relationships well enough for LLMs to perform effective decoupled reasoning.

What would settle it

An experiment that measures whether the reported performance gain disappears when the same RL training is applied to video inputs whose spatial-temporal continuity has been deliberately broken while keeping semantic content intact.

Figures

Figures reproduced from arXiv: 2606.17279 by Han Zhang, Mathias Unberath, Yiqing Shen.

**Figure 1.** Figure 1: Overview of the proposed method. LLM generates structured rollout sequences to plan, construct, and reason over digital twin representations of surgical videos. require multi-step semantic, spatial, and temporal reasoning across varying complexity levels. 2 Methods Structure of Rollout Sequence. Given a surgical video V = {I (1), . . . , I (T)} with T frames and an implicit query Q, we train LLM to genera… view at source ↗

read the original abstract

Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to fix token compression in surgical VideoQA by training LLMs via RL on digital twin reps from foundation models, but the continuity preservation claim is the part that needs checking.

read the letter

The core move is an RL setup that lets the LLM work over hierarchical digital twin representations (frame, window, procedure levels with uncertainty) instead of direct video tokens, plus a reward that mixes format, accuracy, clinical plausibility, and uncertainty calibration. They also release REAL-Colon-Reason, a 2000-pair colonoscopic benchmark with three complexity tiers, and report SOTA on that plus two prior surgical VideoQA sets.

What stands out is the explicit attempt to decouple perception from reasoning through this intermediate representation and the new benchmark. The reward design and the multi-level hierarchy with probabilistic estimates are concrete additions that target a documented weakness in existing token-based pipelines.

The soft spot is exactly the one the stress-test flags. The whole argument rests on digital twins built from surgical foundation models actually keeping enough continuous spatial-temporal structure so that RL can do the decoupling work. If those twins inherit the same discretization or compression the paper criticizes, then the reported gains could just come from the reward function or dataset-specific tuning rather than the claimed separation. The abstract gives no derivations, no ablation on the twin construction, and no error bars, so it is not possible to tell whether the assumption holds.

This is for people working on medical video reasoning or RL for vision-language models in constrained domains. A reader who wants the new benchmark and a worked example of uncertainty-aware rewards in surgery will find usable pieces. The paper shows clear thinking about the token limitation and honest engagement with the clinical constraints, so it is coherent on its own terms.

I would send it to peer review so the methods and twin construction details can be examined properly.

Referee Report

3 major / 1 minor

Summary. The paper proposes an RL framework to train LLMs for surgical VideoQA by operating over digital twin representations (constructed from surgical foundation models with hierarchical frame/temporal-window/procedure levels and probabilistic uncertainty estimates) to decouple perception from reasoning. It introduces a novel reward combining format validation, accuracy via clinical plausibility, and uncertainty-aware calibration, plus the new REAL-Colon-Reason benchmark with 2000 QA pairs, claiming SOTA results on REAL-Colon-Reason, REAL-Colon-VQA, and EndoVis18-VQA.

Significance. If the central claim holds, the work could advance multi-step reasoning in medical video understanding by mitigating fragmentation of spatial-temporal structure. The new benchmark and uncertainty-aware reward are useful contributions; however, the significance hinges on whether gains are attributable to the digital-twin decoupling rather than reward engineering or benchmark tuning.

major comments (3)

[Experiments] The manuscript provides no ablation or analysis (e.g., in the Experiments or Results sections) isolating the contribution of digital twin representations from the novel reward function; without this, SOTA claims on the three benchmarks cannot be attributed to decoupling rather than reward design or post-hoc tuning.
[Method] The core assumption that digital twin representations preserve continuous spatial-temporal relationships (unlike token-based compression) is load-bearing for the decoupling claim but receives no direct validation, such as continuity metrics or comparison against foundation-model token outputs at the scales needed for multi-step reasoning.
[Results] Dataset construction details, error bars, and statistical tests for the reported SOTA on REAL-Colon-Reason (and the two existing benchmarks) are absent, undermining verification of the performance claims.

minor comments (1)

[Method] Notation for the hierarchical levels and uncertainty estimates should be formalized with equations for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment below, providing clarifications and committing to revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [Experiments] The manuscript provides no ablation or analysis (e.g., in the Experiments or Results sections) isolating the contribution of digital twin representations from the novel reward function; without this, SOTA claims on the three benchmarks cannot be attributed to decoupling rather than reward design or post-hoc tuning.

Authors: We agree that an ablation study is essential to isolate the effects. The current manuscript focuses on the overall framework performance, but we will add a dedicated ablation section in the revised manuscript comparing the full model against ablated versions (e.g., without hierarchical digital twins, without uncertainty estimates, and using standard rewards) to attribute the contributions appropriately. revision: yes
Referee: [Method] The core assumption that digital twin representations preserve continuous spatial-temporal relationships (unlike token-based compression) is load-bearing for the decoupling claim but receives no direct validation, such as continuity metrics or comparison against foundation-model token outputs at the scales needed for multi-step reasoning.

Authors: The assumption is indeed central, and while the design of hierarchical probabilistic representations is intended to address this, we recognize the value of direct validation. In the revision, we will include quantitative comparisons, such as measuring spatial-temporal continuity metrics between digital twin representations and direct token outputs from the foundation models. revision: yes
Referee: [Results] Dataset construction details, error bars, and statistical tests for the reported SOTA on REAL-Colon-Reason (and the two existing benchmarks) are absent, undermining verification of the performance claims.

Authors: We acknowledge these omissions in the presentation. The revised manuscript will expand the dataset section with full construction details for REAL-Colon-Reason, include error bars from repeated experiments, and report p-values from appropriate statistical tests to substantiate the SOTA claims. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential reductions present

full rationale

The abstract and available text describe an empirical RL framework using digital twin representations from foundation models, hierarchical levels, uncertainty estimates, and a composite reward, plus SOTA results on new and existing benchmarks. No equations, derivations, fitted parameters presented as predictions, or self-citations appear. No load-bearing step reduces by construction to its inputs, satisfying the requirement to quote specific reductions before flagging circularity. This matches the most common honest finding for papers without mathematical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract introduces several new constructs without upstream evidence or parameter counts; full paper would be needed to enumerate fitted values or background axioms.

invented entities (3)

digital twin representations no independent evidence
purpose: Decouple perception from reasoning by providing structured input to LLMs instead of compressed video tokens
Constructed from surgical foundation models; no independent evidence supplied in abstract
hierarchical representations with probabilistic uncertainty estimates no independent evidence
purpose: Capture information at frame, temporal window, and procedure levels
Proposed as part of the framework; no external validation mentioned
novel reward combining format validation, accuracy, clinical plausibility, and uncertainty-aware calibration no independent evidence
purpose: Guide RL training of the LLM
Introduced specifically for this training setup; no prior reference or external test given

pith-pipeline@v0.9.1-grok · 5707 in / 1309 out tokens · 41113 ms · 2026-06-27T03:37:16.038765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 9 linked inside Pith

[1]

Surgical- vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023

Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, and Hongliang Ren. Surgical- vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023

arXiv 2023
[2]

Shuai Bai et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[3]

Qwen3-vl technical report, 2025

Shuai Bai et al. Qwen3-vl technical report, 2025

2025
[4]

Real-colon: A dataset for developing real-world ai applications in colonoscopy.Scientific Data, 11(1):539, 2024

Carlo Biffi et al. Real-colon: A dataset for developing real-world ai applications in colonoscopy.Scientific Data, 11(1):539, 2024

2024
[5]

Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery

Kexin Chen et al. Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10772–10778. IEEE, 2024

2024
[6]

Visual question answering in robotic surgery: A comprehensive review.IEEE Access, 2025

Di Ding, Tianliang Yao, Rong Luo, and Xusen Sun. Visual question answering in robotic surgery: A comprehensive review.IEEE Access, 2025

2025
[7]

Surgvivqa: Temporally-grounded video question an- swering for surgical scene understanding.arXiv preprint arXiv:2511.03325, 2025

Mauro Orazio Drago et al. Surgvivqa: Temporally-grounded video question an- swering for surgical scene understanding.arXiv preprint arXiv:2511.03325, 2025

Pith/arXiv arXiv 2025
[8]

Automated capture of intraoperative adverse events using artificial intelligence: a systematic review and meta-analysis.Journal of Clinical Medicine, 12(4):1687, 2023

Michael B Eppler et al. Automated capture of intraoperative adverse events using artificial intelligence: a systematic review and meta-analysis.Journal of Clinical Medicine, 12(4):1687, 2023

2023
[9]

Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery

Runlong He et al. Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 488–498. Springer, 2024

2024
[10]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[11]

Smile: A composite lexical-semantic metric for question- answering evaluation.arXiv preprint arXiv:2511.17432, 2025

Shrikant Kendre et al. Smile: A composite lexical-semantic metric for question- answering evaluation.arXiv preprint arXiv:2511.17432, 2025

arXiv 2025
[12]

Recognize any surgical object: Unleashing the power of weakly- supervised data.arXiv preprint arXiv:2501.15326, 2025

Jiajie Li et al. Recognize any surgical object: Unleashing the power of weakly- supervised data.arXiv preprint arXiv:2501.15326, 2025

arXiv 2025
[13]

Neural-symbolic videoqa: Learning compositional spatio-temporal reasoning for real-world video question answering.arXiv preprint arXiv:2404.04007, 2024

Lili Liang, Guanglu Sun, Jin Qiu, and Lizhong Zhang. Neural-symbolic videoqa: Learning compositional spatio-temporal reasoning for real-world video question answering.arXiv preprint arXiv:2404.04007, 2024

arXiv 2024
[14]

Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning.arXiv preprint arXiv:2408.07931, 2024

Haofeng Liu et al. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning.arXiv preprint arXiv:2408.07931, 2024. 10 Y. Shen et al

arXiv 2024
[15]

Sam2s: Segment anything in surgical videos via semantic long- term tracking.arXiv preprint arXiv:2511.16618, 2025

Haofeng Liu et al. Sam2s: Segment anything in surgical videos via semantic long- term tracking.arXiv preprint arXiv:2511.16618, 2025

arXiv 2025
[16]

Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36, 2024

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36, 2024

2024
[17]

When to trust the answer: Question-aligned semantic nearest neighbor entropy for safer surgical vqa.arXiv preprint arXiv:2511.01458, 2025

Dennis Pierantozzi, Luca Carlini, Mauro Orazio Drago, Chiara Lena, Cesare Has- san, Elena De Momi, Danail Stoyanov, Sophia Bano, and Mobarak I Hoque. When to trust the answer: Question-aligned semantic nearest neighbor entropy for safer surgical vqa.arXiv preprint arXiv:2511.01458, 2025

Pith/arXiv arXiv 2025
[18]

Surgicalgpt: end-to-end language-vision gpt for vi- sual question answering in surgery

Lalithkumar Seenivasan et al. Surgicalgpt: end-to-end language-vision gpt for vi- sual question answering in surgery. InInternational conference on medical image computing and computer-assisted intervention, pages 281–290. Springer, 2023

2023
[19]

Surgical-vqa: Visual question answering in surgical scenes using transformer

Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Krishna, and Hongliang Ren. Surgical-vqa: Visual question answering in surgical scenes using transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 33–43. Springer, 2022

2022
[20]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[22]

Position: Foundation models need digital twin representations.arXiv preprint arXiv:2505.03798, 2025

Yiqing Shen, Hao Ding, Lalithkumar Seenivasan, Tianmin Shu, and Mathias Un- berath. Position: Foundation models need digital twin representations.arXiv preprint arXiv:2505.03798, 2025

arXiv 2025
[23]

Reasoning text- to-video retrieval via digital twin video representations and large language models

Yiqing Shen, Chenxiao Fan, Chenjia Li, and Mathias Unberath. Reasoning text- to-video retrieval via digital twin video representations and large language models. arXiv preprint arXiv:2511.12371, 2025

arXiv 2025
[24]

Text-driven reasoning video editing via reinforcement learning on digital twin representations.arXiv preprint arXiv:2511.14100, 2025

Yiqing Shen, Chenjia Li, and Mathias Unberath. Text-driven reasoning video editing via reinforcement learning on digital twin representations.arXiv preprint arXiv:2511.14100, 2025

arXiv 2025
[25]

Online reasoning video segmentation with just-in-time digital twins.arXiv preprint arXiv:2503.21056, 2025

Yiqing Shen, Bohan Liu, Chenjia Li, Lalithkumar Seenivasan, and Mathias Un- berath. Online reasoning video segmentation with just-in-time digital twins.arXiv preprint arXiv:2503.21056, 2025

arXiv 2025
[26]

Constructing and interpreting digital twin representations for visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.12365, 2025

Yiqing Shen and Mathias Unberath. Constructing and interpreting digital twin representations for visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.12365, 2025

arXiv 2025
[27]

Explore multi-step rea- soning in video question answering

Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. Explore multi-step rea- soning in video question answering. InProceedings of the 26th ACM international conference on Multimedia, pages 239–247, 2018

2018
[28]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Guankun Wang et al. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

arXiv 2024
[29]

Grounded-videollm: Sharpening fine-grained temporal ground- ing in video large language models.arXiv preprint arXiv:2410.03290, 2024

Haibo Wang et al. Grounded-videollm: Sharpening fine-grained temporal ground- ing in video large language models.arXiv preprint arXiv:2410.03290, 2024

arXiv 2024
[30]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[31]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[32]

Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, et al. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

Pith/arXiv arXiv 2024
[33]

Advancing surgical vqa with scene graph knowledge.International journal of computer assisted radiology and surgery, 19(7):1409–1417, 2024

Kun Yuan et al. Advancing surgical vqa with scene graph knowledge.International journal of computer assisted radiology and surgery, 19(7):1409–1417, 2024. Reasoning-Intensive Surgical VideoQA 11

2024
[34]

Cascade multi-level transformer network for surgical workflow analysis.IEEE transactions on medical imaging, 42(10):2817–2831, 2023

Wenxi Yue et al. Cascade multi-level transformer network for surgical workflow analysis.IEEE transactions on medical imaging, 42(10):2817–2831, 2023

2023
[35]

Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Boqiang Zhang et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Pith/arXiv arXiv 2025
[36]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

2023
[37]

Proreason: Multi-modal proactive reasoning with decoupled eyesight and wisdom

Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, and Chuan Wu. Proreason: Multi-modal proactive reasoning with decoupled eyesight and wisdom. InProceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing, pages 31650–31679, 2025

2025
[38]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025

[1] [1]

Surgical- vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023

Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, and Hongliang Ren. Surgical- vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023

arXiv 2023

[2] [2]

Shuai Bai et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[3] [3]

Qwen3-vl technical report, 2025

Shuai Bai et al. Qwen3-vl technical report, 2025

2025

[4] [4]

Real-colon: A dataset for developing real-world ai applications in colonoscopy.Scientific Data, 11(1):539, 2024

Carlo Biffi et al. Real-colon: A dataset for developing real-world ai applications in colonoscopy.Scientific Data, 11(1):539, 2024

2024

[5] [5]

Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery

Kexin Chen et al. Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10772–10778. IEEE, 2024

2024

[6] [6]

Visual question answering in robotic surgery: A comprehensive review.IEEE Access, 2025

Di Ding, Tianliang Yao, Rong Luo, and Xusen Sun. Visual question answering in robotic surgery: A comprehensive review.IEEE Access, 2025

2025

[7] [7]

Surgvivqa: Temporally-grounded video question an- swering for surgical scene understanding.arXiv preprint arXiv:2511.03325, 2025

Mauro Orazio Drago et al. Surgvivqa: Temporally-grounded video question an- swering for surgical scene understanding.arXiv preprint arXiv:2511.03325, 2025

Pith/arXiv arXiv 2025

[8] [8]

Automated capture of intraoperative adverse events using artificial intelligence: a systematic review and meta-analysis.Journal of Clinical Medicine, 12(4):1687, 2023

Michael B Eppler et al. Automated capture of intraoperative adverse events using artificial intelligence: a systematic review and meta-analysis.Journal of Clinical Medicine, 12(4):1687, 2023

2023

[9] [9]

Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery

Runlong He et al. Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 488–498. Springer, 2024

2024

[10] [10]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[11] [11]

Smile: A composite lexical-semantic metric for question- answering evaluation.arXiv preprint arXiv:2511.17432, 2025

Shrikant Kendre et al. Smile: A composite lexical-semantic metric for question- answering evaluation.arXiv preprint arXiv:2511.17432, 2025

arXiv 2025

[12] [12]

Recognize any surgical object: Unleashing the power of weakly- supervised data.arXiv preprint arXiv:2501.15326, 2025

Jiajie Li et al. Recognize any surgical object: Unleashing the power of weakly- supervised data.arXiv preprint arXiv:2501.15326, 2025

arXiv 2025

[13] [13]

Neural-symbolic videoqa: Learning compositional spatio-temporal reasoning for real-world video question answering.arXiv preprint arXiv:2404.04007, 2024

Lili Liang, Guanglu Sun, Jin Qiu, and Lizhong Zhang. Neural-symbolic videoqa: Learning compositional spatio-temporal reasoning for real-world video question answering.arXiv preprint arXiv:2404.04007, 2024

arXiv 2024

[14] [14]

Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning.arXiv preprint arXiv:2408.07931, 2024

Haofeng Liu et al. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning.arXiv preprint arXiv:2408.07931, 2024. 10 Y. Shen et al

arXiv 2024

[15] [15]

Sam2s: Segment anything in surgical videos via semantic long- term tracking.arXiv preprint arXiv:2511.16618, 2025

Haofeng Liu et al. Sam2s: Segment anything in surgical videos via semantic long- term tracking.arXiv preprint arXiv:2511.16618, 2025

arXiv 2025

[16] [16]

Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36, 2024

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36, 2024

2024

[17] [17]

When to trust the answer: Question-aligned semantic nearest neighbor entropy for safer surgical vqa.arXiv preprint arXiv:2511.01458, 2025

Dennis Pierantozzi, Luca Carlini, Mauro Orazio Drago, Chiara Lena, Cesare Has- san, Elena De Momi, Danail Stoyanov, Sophia Bano, and Mobarak I Hoque. When to trust the answer: Question-aligned semantic nearest neighbor entropy for safer surgical vqa.arXiv preprint arXiv:2511.01458, 2025

Pith/arXiv arXiv 2025

[18] [18]

Surgicalgpt: end-to-end language-vision gpt for vi- sual question answering in surgery

Lalithkumar Seenivasan et al. Surgicalgpt: end-to-end language-vision gpt for vi- sual question answering in surgery. InInternational conference on medical image computing and computer-assisted intervention, pages 281–290. Springer, 2023

2023

[19] [19]

Surgical-vqa: Visual question answering in surgical scenes using transformer

Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Krishna, and Hongliang Ren. Surgical-vqa: Visual question answering in surgical scenes using transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 33–43. Springer, 2022

2022

[20] [20]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[21] [21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[22] [22]

Position: Foundation models need digital twin representations.arXiv preprint arXiv:2505.03798, 2025

Yiqing Shen, Hao Ding, Lalithkumar Seenivasan, Tianmin Shu, and Mathias Un- berath. Position: Foundation models need digital twin representations.arXiv preprint arXiv:2505.03798, 2025

arXiv 2025

[23] [23]

Reasoning text- to-video retrieval via digital twin video representations and large language models

Yiqing Shen, Chenxiao Fan, Chenjia Li, and Mathias Unberath. Reasoning text- to-video retrieval via digital twin video representations and large language models. arXiv preprint arXiv:2511.12371, 2025

arXiv 2025

[24] [24]

Text-driven reasoning video editing via reinforcement learning on digital twin representations.arXiv preprint arXiv:2511.14100, 2025

Yiqing Shen, Chenjia Li, and Mathias Unberath. Text-driven reasoning video editing via reinforcement learning on digital twin representations.arXiv preprint arXiv:2511.14100, 2025

arXiv 2025

[25] [25]

Online reasoning video segmentation with just-in-time digital twins.arXiv preprint arXiv:2503.21056, 2025

Yiqing Shen, Bohan Liu, Chenjia Li, Lalithkumar Seenivasan, and Mathias Un- berath. Online reasoning video segmentation with just-in-time digital twins.arXiv preprint arXiv:2503.21056, 2025

arXiv 2025

[26] [26]

Constructing and interpreting digital twin representations for visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.12365, 2025

Yiqing Shen and Mathias Unberath. Constructing and interpreting digital twin representations for visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.12365, 2025

arXiv 2025

[27] [27]

Explore multi-step rea- soning in video question answering

Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. Explore multi-step rea- soning in video question answering. InProceedings of the 26th ACM international conference on Multimedia, pages 239–247, 2018

2018

[28] [28]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Guankun Wang et al. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

arXiv 2024

[29] [29]

Grounded-videollm: Sharpening fine-grained temporal ground- ing in video large language models.arXiv preprint arXiv:2410.03290, 2024

Haibo Wang et al. Grounded-videollm: Sharpening fine-grained temporal ground- ing in video large language models.arXiv preprint arXiv:2410.03290, 2024

arXiv 2024

[30] [30]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[31] [31]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[32] [32]

Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, et al. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

Pith/arXiv arXiv 2024

[33] [33]

Advancing surgical vqa with scene graph knowledge.International journal of computer assisted radiology and surgery, 19(7):1409–1417, 2024

Kun Yuan et al. Advancing surgical vqa with scene graph knowledge.International journal of computer assisted radiology and surgery, 19(7):1409–1417, 2024. Reasoning-Intensive Surgical VideoQA 11

2024

[34] [34]

Cascade multi-level transformer network for surgical workflow analysis.IEEE transactions on medical imaging, 42(10):2817–2831, 2023

Wenxi Yue et al. Cascade multi-level transformer network for surgical workflow analysis.IEEE transactions on medical imaging, 42(10):2817–2831, 2023

2023

[35] [35]

Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Boqiang Zhang et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Pith/arXiv arXiv 2025

[36] [36]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

2023

[37] [37]

Proreason: Multi-modal proactive reasoning with decoupled eyesight and wisdom

Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, and Chuan Wu. Proreason: Multi-modal proactive reasoning with decoupled eyesight and wisdom. InProceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing, pages 31650–31679, 2025

2025

[38] [38]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025