arxiv: 2603.16024 · v2 · submitted 2026-03-17 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z.Y. Mao , Francis X. Creighton , Russell H. Taylor , Manish Sahu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords speech-guided surgeryvideo-based navigationsurgical instrument trackingskull base surgerymonocular pose estimationreal-time image guidanceinteractive segmentationembodied agent

0 comments

The pith

Surgeons can use speech on live video alone to segment instruments, track them, and guide skull-base surgery with 2.32 mm accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a speech-guided framework that lets surgeons request perception tasks directly on intraoperative video streams without leaving the operative field. It begins with interactive segmentation of the surgical instrument, then treats the segmented tool as a spatial anchor for autonomous tracking that supports 3D model registration, monocular pose estimation, and real-time anatomical overlays. This replaces conventional optical tracking systems that require external hardware and lengthy setup. Across three trials the method delivered mean tool-tip errors of 2.32 mm and completed registration in roughly two minutes. A reader would care because the approach could simplify operating-room workflows and keep surgeons focused on the patient rather than on calibration procedures.

Core claim

The authors introduce a speech-guided embodied agent framework that integrates natural language interaction with real-time visual perception on live intraoperative video. The system performs interactive segmentation and labeling of the surgical instrument, uses the segmented instrument as a spatial anchor that is autonomously tracked, and thereby enables anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based tool-pose estimation, and image guidance through real-time anatomical overlays. In three experimental trials the hybrid vision-based method produced a mean absolute tool-tip position error of 2.32 ± 1.10 mm in the camera frame together with yaw

What carries the argument

Interactive segmentation of the surgical instrument used as a spatial anchor for autonomous tracking and monocular pose estimation on live video.

If this is right

Tool-tip position can be recovered to a mean absolute error of 2.32 mm in the camera frame.
Inter-frame yaw and pitch propagation discrepancies remain at 0.18° and 0.21° on average.
Tool segmentation and anatomy registration finish in approximately two minutes.
Real-time anatomical overlays become available for image guidance without external trackers.
Setup complexity is reduced relative to conventional optical-tracking workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same instrument-anchor approach could be applied to other endoscopic procedures that already rely on video as the primary view.
Voice-driven task triggering may lower the need for assistants to operate separate navigation consoles.
Performance under prolonged occlusion or rapid camera motion remains an open test that would directly affect clinical adoption.
If the anchor remains stable, downstream monocular pose estimates could be chained to produce continuous 3-D guidance without recalibration.

Load-bearing premise

Interactive segmentation of the instrument remains reliable and subsequent tracking stays robust under real intraoperative conditions including variable lighting, blood, smoke, and tissue motion.

What would settle it

A recorded surgery sequence in which blood or smoke appears and the measured tool-tip position error exceeds 5 mm or tracking is lost for more than a few frames.

Figures

Figures reproduced from arXiv: 2603.16024 by Francis X. Creighton, Jecia Z.Y. Mao, Manish Sahu, Russell H. Taylor.

**Figure 1.** Figure 1: System overview of the embodied surgical agent. The surgeon interacts through a hands-free interface (speech-totext) that issues high-level commands to the front end, which orchestrates live video streaming, tool/anatomy segmentation, pose tracking, and anatomy registration. Intermediate outputs (masks, pose hypotheses, and registered anatomy overlays) are persisted in a streaming memory and can be retrie… view at source ↗

**Figure 2.** Figure 2: Speech-driven tool segmentation with streaming memory and catch-up propagation. A voice command triggers GSAM to generate candidate masks at time t0. After the surgeon confirms a proposal, the selected mask is stored and used to seed propagation through buffered intermediate frames, producing an updated mask at the latest time tn so online tracking can resume without interrupting tool motion. D. Interactiv… view at source ↗

**Figure 3.** Figure 3: Interactive anatomy segmentation and refinement with event-triggered prompt retrieval. A voice command initiates anatomy segmentation while the system buffers the drill-tip trajectory. When the surgeon says “Done,” the buffered trajectory is converted into spatial prompts for SAM to generate an initial mask. Refinement is then triggered by a “Click” command on regions to remove, which adds prompts for mask… view at source ↗

**Figure 5.** Figure 5: Surgical navigation with registered anatomical overlay. After registration, the segmented anatomical structures are transformed into the camera frame and rendered directly onto the live surgical view. Color-coded regions denote critical structures (e.g., facial nerve, cochlear nerve, vestibular aqueduct), providing spatial context within the operative field. The highlighted region illustrates how the syst… view at source ↗

**Figure 6.** Figure 6: Stereo vs. fore-ground conditioned monocular depth for surgical tool reconstruction. Stereo depth estimates degrade under narrow baseline, leading to noisy 3D reconstructions, whereas DepthAnything v2 constrained to the segmented fore-ground produces denser, more consistent depth and a smoother tool point cloud. contrast, the fore-ground conditioned monocular depth model provides denser and more stable p… view at source ↗

**Figure 7.** Figure 7: Foreground-conditioned monocular depth scaling for metric tool reconstruction. From the real-time video stream, the mask tracking module enables a foregroundmasked frame. The registered anatomy provides a reliable metric depth reference, enabling an affine scale alignment of DepthAnything v2 predictions to obtain a scaled depth map in millimeters for both anatomy and tool regions. The affine parameters ar… view at source ↗

**Figure 8.** Figure 8: 3D tool pose initialization from mask geometry and metric depth. From the tracked tool mask and foregroundconditioned metric depth, the 2D branch extracts the principal direction d C M and apparent length ℓ px mask, while the 3D branch back-projects the masked depth to obtain the tool point set P C tool, a coarse PCA axis prior d˜C 3D, and the tip position p C tip. From the CAD model, we use the mesh tip … view at source ↗

**Figure 9.** Figure 9: Experimental setup for quantitative comparison to optical tracking. A surgical microscope provides the primary operating view and streams video to a large monitor for realtime visualization. The main display shows the live video with real-time tool and anatomy segmentation overlays, while the inset view visualizes the vision-tracked tool mesh together with overlaid critical anatomical structures. A microp… view at source ↗

**Figure 10.** Figure 10: Trial 1 tool-tip trajectory in the camera frame (Hybrid Approach). The x, y, and z components are plotted over time (time-matched sample index) for the vision-based method (Blue) and the optical tracker (Orange) [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Trial 2 tool-tip trajectory in the camera frame (Hybrid Approach). results indicate that the proposed interaction design supports rapid setup and may be practical for intraoperative use without introducing major workflow overhead. D. Discussion Overall, the results demonstrate that the proposed system performs well across two complementary dimensions: accuracy, and workflow efficiency. The full workflow … view at source ↗

**Figure 15.** Figure 15: Example of per-step roll propagation in the camera frame. Absolute inter-frame roll changes are shown for the proposed vision-based tracker and the optical tracker. The vision-based estimate remains smooth over time, while the optical tracker exhibits an isolated spike, likely caused by transient line-of-sight loss or marker occlusion. This example highlights the robustness of the proposed method to commo… view at source ↗

**Figure 14.** Figure 14: Tool-tip depth trajectory in the camera frame using Video Depth Anything. Although designed for video input, Video Depth Anything does not improve stability in this setting. The estimated depth trajectory still contains substantial fluctuations and, in some cases, even larger deviations than the frame-wise depth-derived approach. anchored metric depth with mask-based geometric constraints, the hybrid appr… view at source ↗

read the original abstract

We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and image guidance through real-time anatomical overlays. We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Across three experimental trials, the hybrid vision-based method achieved a mean absolute tool-tip position error of 2.32 Plus/Minus 1.10 mm in the camera frame, with inter-frame yaw and pitch propagation discrepancies of 0.18 Plus/Minus 0.25{\deg} and 0.21 Plus/Minus 0.30{\deg}, respectively. The system completes tool segmentation and anatomy registration within approximately two minutes, substantially reducing setup complexity relative to conventional tracking workflows. These results demonstrate that speech-guided embodied agents can provide accurate spatial guidance while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper integrates speech commands with video-based instrument tracking and registration for skull-base surgery, reporting 2.3 mm tip error from three trials against a commercial tracker.

read the letter

The main point is a system that lets surgeons issue voice commands during skull-base procedures to segment instruments on live video, track them autonomously, register preoperative models, and generate overlays without external optical hardware. The evaluation gives a concrete benchmark: 2.32 ± 1.10 mm mean absolute tool-tip error in the camera frame and sub-degree inter-frame angular discrepancies across three trials, with setup claimed to finish in roughly two minutes.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a speech-guided embodied agent framework for video-guided skull-base surgery that integrates natural language interaction with real-time visual perception on intraoperative video streams. The system performs interactive segmentation and labeling of the surgical instrument, uses it as a spatial anchor for autonomous tracking, anatomical segmentation, preoperative 3D model registration, monocular tool pose estimation, and real-time anatomical overlays. It reports a mean absolute tool-tip position error of 2.32 ± 1.10 mm and inter-frame yaw/pitch discrepancies of 0.18 ± 0.25° and 0.21 ± 0.30° from three experimental trials benchmarked against a commercial optical tracker, with setup completed in approximately two minutes.

Significance. If the reported accuracy holds under realistic intraoperative variability, the approach could meaningfully reduce hardware dependencies and setup time compared to conventional optical tracking systems by operating purely on video and speech. The direct empirical comparison to an external optical tracker provides a concrete, falsifiable baseline. However, the preliminary nature of the evaluation limits the strength of claims about workflow integration and reliability in live surgery.

major comments (2)

[Evaluation] Evaluation section: Performance metrics are derived from only three experimental trials with no reported details on trial conditions, exclusion criteria, statistical analysis, success/failure rates, or robustness under variable lighting, blood, smoke, or tissue motion. This directly undermines extrapolation of the 2.32 ± 1.10 mm tool-tip error and sub-degree angular discrepancies to intraoperative use, as noted in the weakest assumption.
[Methods] Methods section: The pipeline treats interactive segmentation as a reliable spatial anchor for all downstream tasks (tracking, registration, pose estimation), yet no quantitative metrics (e.g., Dice scores, failure rates, or inter-operator variability) are provided for the segmentation step itself.

minor comments (2)

[Abstract] Abstract: Replace 'Plus/Minus' with the standard ± symbol and ensure consistent formatting for the angular units (e.g., °).
[Abstract] Abstract: Clarify whether the reported 'inter-frame yaw and pitch propagation discrepancies' represent cumulative drift, per-frame error, or another quantity, and specify the reference frame.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important limitations in our preliminary evaluation. We agree that the current results are based on a small number of trials and will revise the manuscript to provide greater transparency on experimental details, limitations, and the segmentation component while maintaining the focus on the video-only, speech-guided approach.

read point-by-point responses

Referee: [Evaluation] Evaluation section: Performance metrics are derived from only three experimental trials with no reported details on trial conditions, exclusion criteria, statistical analysis, success/failure rates, or robustness under variable lighting, blood, smoke, or tissue motion. This directly undermines extrapolation of the 2.32 ± 1.10 mm tool-tip error and sub-degree angular discrepancies to intraoperative use, as noted in the weakest assumption.

Authors: We acknowledge that the evaluation is preliminary and limited to three controlled experimental trials. In the revised manuscript we will expand the Evaluation section to describe the trial conditions in detail, any exclusion criteria used, the statistical methods applied to the reported means and standard deviations, and observed success/failure rates. We will also add an explicit limitations paragraph stating that robustness to variable lighting, blood, smoke, or tissue motion was not tested and that the 2.32 mm accuracy applies only to the specific scenarios evaluated. These changes will prevent over-extrapolation while preserving the direct comparison to the optical tracker. revision: yes
Referee: [Methods] Methods section: The pipeline treats interactive segmentation as a reliable spatial anchor for all downstream tasks (tracking, registration, pose estimation), yet no quantitative metrics (e.g., Dice scores, failure rates, or inter-operator variability) are provided for the segmentation step itself.

Authors: We agree that quantitative characterization of the interactive segmentation step would strengthen the paper. In revision we will expand the Methods section with a dedicated description of the speech-guided segmentation process, including how the surgeon verifies and corrects the output in real time. Because Dice scores and inter-operator variability were not collected in the original experiments, we will instead report any observed segmentation failures or corrections from the three trials and discuss the interactive verification as a built-in safeguard. This will be framed as a limitation with a note that future work will include formal segmentation metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical performance reporting with external benchmark

full rationale

The paper presents a system description and evaluates it through direct experimental trials benchmarked against a commercial optical tracker. No equations, parameter fitting, self-citations, or derivations are present that reduce any claimed result to its own inputs by construction. The reported tool-tip error and angular discrepancies are measured outcomes from three trials, not predictions derived from fitted models or prior self-referential claims. This is a standard empirical systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical integration of standard computer-vision pipelines rather than new theoretical constructs or fitted parameters.

axioms (1)

domain assumption Standard computer-vision assumptions for segmentation and tracking hold under intraoperative video conditions.
The system depends on reliable real-time performance of segmentation and tracking algorithms in the presence of surgical artifacts.

pith-pipeline@v0.9.0 · 5599 in / 1120 out tokens · 40100 ms · 2026-05-15T09:46:27.692398+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across three experimental trials, the hybrid vision-based method achieved a mean absolute tool-tip position error of 2.32 ± 1.10 mm in the camera frame, with inter-frame yaw and pitch propagation discrepancies of 0.18 ± 0.25° and 0.21 ± 0.30°, respectively.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Comprehensive microsurgical anatomy of the middle cranial fossa: Part I—Osseous and meningeal anatomy,

A. T. Meybodi, G. Mignucci-Jim ´enez, M. T. Lawton, J. K. Liu, M. C. Preul, and H. Sun, “Comprehensive microsurgical anatomy of the middle cranial fossa: Part I—Osseous and meningeal anatomy,”Frontiers in Surgery, vol. 10, 2023

work page 2023
[2]

Cai4cai: The rise of contextual artificial intelligence in computer-assisted interventions,

T. Vercauteren, M. Unberath, N. Padoy, and N. Navab, “Cai4cai: The rise of contextual artificial intelligence in computer-assisted interventions,” Proceedings of the IEEE, vol. 108, no. 1, pp. 198–214, 2019. IEEE TRANSACTIONS 13

work page 2019
[3]

Artificial intelligence and automation in endoscopy and surgery,

F. Chadebecq, L. B. Lovat, and D. Stoyanov, “Artificial intelligence and automation in endoscopy and surgery,”Nature Reviews Gastroenterology & Hepatology, vol. 20, no. 3, pp. 171–182, 2023

work page 2023
[4]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inICCV, 2023, pp. 4015–4026

work page 2023
[5]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Text prompt- able surgical instrument segmentation with vision-language models,

Z. Zhou, O. Alabi, M. Wei, T. Vercauteren, and M. Shi, “Text prompt- able surgical instrument segmentation with vision-language models,” NeurIPS, vol. 36, 2023

work page 2023
[7]

Video-instrument synergistic network for referring video instrument segmentation in robotic surgery,

H. Wang, G. Yang, S. Zhang, J. Qin, Y . Guo, B. Xu, Y . Jin, and L. Zhu, “Video-instrument synergistic network for referring video instrument segmentation in robotic surgery,”IEEE Transactions on Medical Imag- ing, 2024

work page 2024
[8]

Surgraw: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence,

C. H. Low, Z. Wang, T. Zhang, Z. Zeng, Z. Zhuo, E. B. Mazomenos, and Y . Jin, “Surgraw: Multi-agent workflow with chain-of-thought reasoning for surgical intelligence,”arXiv preprint arXiv:2503.10265, 2025

work page arXiv 2025
[9]

Llava-surg: towards multimodal surgical assistant via structured surgical video learning,

J. Li, G. Skinner, G. Yang, B. R. Quaranto, S. D. Schwaitzberg, P. C. Kim, and J. Xiong, “Llava-surg: towards multimodal surgical assistant via structured surgical video learning,”arXiv preprint arXiv:2408.07981, 2024

work page arXiv 2024
[10]

Sufia: language-guided augmented dex- terity for robotic surgical assistants,

M. Moghani, L. Doorenbos, W. C.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, and A. Garg, “Sufia: language-guided augmented dex- terity for robotic surgical assistants,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 6969–6976

work page 2024
[11]

Core challenges in embodied vision-language planning,

J. Francis, N. Kitamura, F. Labelle, X. Lu, I. Navarro, and J. Oh, “Core challenges in embodied vision-language planning,”Journal of Artificial Intelligence Research, vol. 74, pp. 459–515, 2022

work page 2022
[12]

Surgbox: Agent-driven oper- ating room sandbox with surgery copilot,

J. Wu, X. Liang, X. Bai, and Z. Chen, “Surgbox: Agent-driven oper- ating room sandbox with surgery copilot,” in2024 IEEE International Conference on Big Data (BigData). IEEE, 2024, pp. 2041–2048

work page 2024
[13]

Vs-assistant: versatile surgery assistant on the demand of surgeons,

Z. Chen, X. Luo, J. Wu, D. Chan, Z. Lei, J. Wang, S. Ourselin, and H. Liu, “Vs-assistant: versatile surgery assistant on the demand of surgeons,”arXiv preprint arXiv:2405.08272, 2024

work page arXiv 2024
[14]

Surgicalvlm-agent: Towards an interac- tive ai co-pilot for pituitary surgery,

J. Huang, R. He, D. Z. Khan, E. Mazomenos, D. Stoyanov, H. J. Marcus, M. J. Clarkson, and M. Islam, “Surgicalvlm-agent: Towards an interac- tive ai co-pilot for pituitary surgery,”arXiv preprint arXiv:2503.09474, 2025

work page arXiv 2025
[15]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV. Springer, 2024, pp. 38–55

work page 2024
[16]

Surgical- SAM: Efficient class promptable surgical instrument segmentation,

W. Yue, J. Zhang, K. Hu, Y . Xia, J. Luo, and Z. Wang, “Surgical- SAM: Efficient class promptable surgical instrument segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 6890–6898

work page 2024
[17]

Putting the object back into video object segmentation,

H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing, “Putting the object back into video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3151–3161

work page 2024
[18]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Gsam+ cutie: Text-promptable tool mask annotation for endoscopic video,

R. D. Soberanis-Mukul, J. Cheng, J. E. Mangulabnan, S. S. Vedula, M. Ishii, G. Hager, R. H. Taylor, and M. Unberath, “Gsam+ cutie: Text-promptable tool mask annotation for endoscopic video,” inCVPR Workshop, 2024, pp. 2388–2394

work page 2024
[20]

Lisa++: An improved baseline for reasoning segmentation with large language model,

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “Lisa++: An improved baseline for reasoning segmentation with large language model,”arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[21]

Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,

Z. Li, H. Shu, R. Liang, A. Goodridge, M. Sahu, F. X. Creighton, R. H. Taylor, and M. Unberath, “Tatoo: vision-based joint tracking of anatomy and tool for skull-base surgery,”International journal of computer assisted radiology and surgery, vol. 18, no. 7, pp. 1303–1310, 2023

work page 2023
[22]

Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation,

Y . Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6738–6748

work page 2022
[23]

Vision-based surgical tool pose estimation for the da vinci® robotic surgical system,

R. Hao, O. ¨Ozg¨uner, and M. C. C ¸ avus ¸o˘glu, “Vision-based surgical tool pose estimation for the da vinci® robotic surgical system,” in2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2018, pp. 1298–1305

work page 2018
[24]

Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,

M. K. Hasan, L. Calvet, N. Rabbani, and A. Bartoli, “Detection, seg- mentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry,”Medical Image Analysis, vol. 70, p. 101994, 2021

work page 2021
[25]

Publisher, YYYY , p

Authors, “Title,” inBook. Publisher, YYYY , p. pp

work page
[26]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 371–10 381

work page 2024
[27]

Epnp: An accurate o(n) so- lution to the pnp problem,

V . Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) so- lution to the pnp problem,” in2009 IEEE 12th International Conference on Computer Vision (ICCV). IEEE, 2009, pp. 1–8

work page 2009
[28]

Igev++: Iterative multi-range geometry encoding volumes for stereo matching,

G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang, “Igev++: Iterative multi-range geometry encoding volumes for stereo matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[29]

Foundationstereo: Zero-shot stereo matching,

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,”CVPR, 2025

work page 2025