arxiv: 2605.14253 · v1 · pith:CONITIDRnew · submitted 2026-05-14 · 💻 cs.CV · cs.LG

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

Harry Robertshaw , Yanghe Hao , Weiyuan Deng , Benjamin Jackson , S.M.Hadi Sadati , Nikola Fischer , Tom Vercauteren , Alejandro Granados

show 1 more author

Thomas C. Booth

This is my paper

Pith reviewed 2026-05-15 02:54 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords catheter tip trackingfluoroscopy segmentationSegFormerautonomous navigationmechanical thrombectomyreal-time pipelinedeep learningstroke intervention

0 comments

The pith

A two-class SegFormer model tracks catheter tips in fluoroscopy at 4.44 mm mean error, supporting real-time robotic navigation for stroke treatment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a multi-threaded pipeline that reads fluoroscopy frames, runs segmentation inference, and applies post-processing to locate the catheter tip in real time. It tests U-Net, U-Net plus transformer, and SegFormer models in both two-class and three-class setups on manually labeled moderate-complexity videos. The two-class SegFormer reaches the lowest tip localization error at 4.44 mm while also beating prior segmentation benchmarks by up to five percent Dice. This accuracy supplies the live coordinates that reinforcement-learning controllers need for autonomous mechanical thrombectomy. The work positions the pipeline as a stable base rather than a finished clinical system.

Core claim

The central claim is that the two-class SegFormer segmentation model, placed inside a multi-threaded pipeline and followed by two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path tracking with contour fallback, delivers 4.44 mm mean absolute tip error on moderate-complexity fluoroscopic video and exceeds earlier CathAction segmentation scores.

What carries the argument

The SegFormer transformer segmentation network combined with the described two-step filtering and greedy path-following post-processing that converts pixel masks into a stable tip coordinate.

If this is right

Live tip coordinates become available at video rate for reinforcement-learning controllers that steer catheters autonomously.
Segmentation Dice scores improve by as much as five percent over the previous CathAction benchmark under the same three-class task.
The multi-threaded design keeps the full pipeline fast enough for closed-loop robotic control.
Stable tracking under noise and partial occlusion removes one barrier to wider deployment of robotic thrombectomy systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline structure could be reused for tracking other endovascular tools once new labeled data for those devices is collected.
Pairing the tracker with simulated training environments would let reinforcement-learning policies be tested before any patient use.
If tip error stays low across more device sizes, the method could support navigation in additional minimally invasive procedures beyond stroke.

Load-bearing premise

The moderate-complexity labeled videos plus the fixed post-processing rules will continue to work when imaging contrast drops or device types and occlusion patterns change in real procedures.

What would settle it

Running the same pipeline on a set of heavily occluded or low-contrast clinical fluoroscopy sequences and measuring tip error above 10 mm would show the accuracy claim does not hold under broader conditions.

Figures

Figures reproduced from arXiv: 2605.14253 by Alejandro Granados, Benjamin Jackson, Harry Robertshaw, Nikola Fischer, S.M.Hadi Sadati, Thomas C. Booth, Tom Vercauteren, Weiyuan Deng, Yanghe Hao.

**Figure 2.** Figure 2: In vitro offline RGB dataset results for binary and multi-class segmentation. Instrument tip annotations in first row have been enhanced for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: In vitro live RGB dataset results. Instrument tip annotations in first row have been enhanced for visualization purposes. Misclassifications are circled in red. 3.4 In vivo fluoroscopy Representative images of the segmentation task for G1: high-complexity are shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: In vivo fluoroscopic dataset for G1 (high complexity) example. Instrument tip annotations in first row have been enhanced for visualization purposes. Tip tracking performance on the in vivo fluoroscopic datasets was evaluated via MAE against ground truth masks of the x, y and (x, y) coordinates, which is presented in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SegFormer gets 4.44 mm MAE on moderate fluoroscopy data and beats the cited baseline slightly, but the robustness claims rest on untested conditions.

read the letter

The key takeaway is that a SegFormer model tracks catheter tips at 4.44 mm mean error on moderate fluoroscopy data and edges out the cited prior work. The paper applies the SegFormer architecture to segment the tip in fluoroscopic frames, then uses a pipeline of component filtering, medial skeletonization, and greedy path following to extract the position. They compare it to U-Net and a hybrid, showing the two-class SegFormer wins on their test set with better Dice scores than CathAction. This is solid applied work. The multi-threaded setup targets real-time use, which fits the goal of supporting robotic MT systems. The numbers are concrete and the post-processing steps are described clearly enough to implement. The main limitation is the test data. Results are reported only for moderate complexity videos. The abstract claims the system handles challenging conditions, but no MAE or Dice figures are given for heavy occlusion, low contrast, or different devices. Without those, the generalization claim rests on the moderate data alone. Also, no training details or ablations on the post-processing are included, so it's unclear how much each part contributes. This kind of paper suits readers working on image-guided robotics or interventional imaging who need a baseline tracker. It gives something concrete to cite or extend. I would recommend sending it to peer review. The core benchmark is new and measurable, and referees can push for more diverse testing.

Referee Report

3 major / 1 minor

Summary. The paper presents a multi-threaded real-time pipeline for catheter tip tracking in fluoroscopic images to support RL-based autonomous navigation in mechanical thrombectomy. It benchmarks U-Net, U-Net+Transformer, and SegFormer segmentation models under two-class and three-class formulations, applies post-processing (two-step component filtering, one-pixel medial skeletonization, greedy arc-length path following with contour fallback), and reports a mean absolute error of 4.44 mm for two-class SegFormer on manually labeled moderate-complexity video data, outperforming the other models and prior CathAction segmentation benchmarks by up to +5% Dice.

Significance. If the reported accuracy holds and generalizes, the work supplies a concrete, deployable tracking module that could enable closed-loop RL control for robotic thrombectomy systems, addressing geographic disparities in stroke care. The direct empirical comparison on held-out labeled frames and the explicit multi-threaded timing considerations are positive elements; however, the absence of quantitative results on heavy occlusion or device variation substantially reduces the immediate translational value.

major comments (3)

[Abstract / Conclusion] Abstract and Conclusion: the statement that the pipeline 'maintains stable performance under challenging imaging conditions' is unsupported; all MAE (4.44 mm) and Dice figures are reported exclusively on moderate-complexity manually labeled data, with no quantitative results (MAE, Dice, or failure rate) supplied for heavy occlusion, low-contrast frames, or alternate catheter geometries.
[Results] Results section: no training details, data-split statistics, number of frames or patients, cross-validation scheme, or error bars accompany the reported 4.44 mm MAE and Dice scores, so the statistical reliability of the claim that two-class SegFormer outperforms U-Net (4.60 mm) and U-Net+Transformer (6.20 mm) cannot be assessed.
[Methods / Results] Methods / Results: the contribution of the heuristic post-processing steps (two-step component filtering, medial skeletonization, greedy arc-length tracking) is not isolated by ablation; it is therefore unclear whether the 4.44 mm MAE is driven by the SegFormer backbone or by the tuned post-processing pipeline.

minor comments (1)

[Figures] Figure captions and axis labels should explicitly state the number of frames and the definition of 'moderate complexity' used for the reported metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting key limitations in our evaluation and reporting. We will revise the manuscript to qualify our claims, add the requested experimental details, and include an ablation study. These changes will strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [Abstract / Conclusion] Abstract and Conclusion: the statement that the pipeline 'maintains stable performance under challenging imaging conditions' is unsupported; all MAE (4.44 mm) and Dice figures are reported exclusively on moderate-complexity manually labeled data, with no quantitative results (MAE, Dice, or failure rate) supplied for heavy occlusion, low-contrast frames, or alternate catheter geometries.

Authors: We agree that the quantitative results are reported only on moderate-complexity data and do not support the broad claim of stable performance under challenging conditions. We will revise the abstract and conclusion to accurately describe the 4.44 mm MAE on moderate-complexity data and add a limitations paragraph in the discussion acknowledging the lack of results on heavy occlusion, low-contrast frames, and alternate geometries. revision: yes
Referee: [Results] Results section: no training details, data-split statistics, number of frames or patients, cross-validation scheme, or error bars accompany the reported 4.44 mm MAE and Dice scores, so the statistical reliability of the claim that two-class SegFormer outperforms U-Net (4.60 mm) and U-Net+Transformer (6.20 mm) cannot be assessed.

Authors: We will expand the results section to include full training hyperparameters, dataset statistics (number of frames and patients), the train/validation/test split details, any cross-validation procedure, and error bars or standard deviations for all reported MAE and Dice scores to enable proper assessment of statistical reliability. revision: yes
Referee: [Methods / Results] Methods / Results: the contribution of the heuristic post-processing steps (two-step component filtering, medial skeletonization, greedy arc-length tracking) is not isolated by ablation; it is therefore unclear whether the 4.44 mm MAE is driven by the SegFormer backbone or by the tuned post-processing pipeline.

Authors: We agree that an ablation study is needed to isolate the post-processing contribution. We will add results comparing the SegFormer model with and without the post-processing pipeline (two-step filtering, skeletonization, and arc-length tracking) to clarify the relative impact of each component on the final 4.44 mm MAE. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking on held-out labels

full rationale

The paper reports straightforward training and evaluation of segmentation models (U-Net, SegFormer, etc.) on manually-labeled moderate-complexity fluoroscopic video, with post-processing steps applied to produce tip coordinates whose error is measured directly against ground truth. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The central result (MAE 4.44 mm) is an independent measurement on test data; generalization statements to heavy occlusion are unsupported but do not create circularity in the reported chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the manually labeled moderate-complexity dataset is representative and that the heuristic post-processing reliably converts segmentation masks into tip coordinates; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)

neural network weights
All segmentation models are trained on the provided labeled data; the fitted weights are not reported and constitute the primary learned parameters.

axioms (1)

domain assumption Manually labeled tip positions in moderate-complexity fluoroscopy accurately reflect true device locations under clinical conditions
Used both for training supervision and for computing the reported MAE.

pith-pipeline@v0.9.0 · 5607 in / 1400 out tokens · 39395 ms · 2026-05-15T02:54:28.055785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Jadhav, Shashvat M

Ashutosh P. Jadhav, Shashvat M. Desai, and Tudor G. Jovin. Indications for mechanical thrombectomy for acute ischemic stroke: Current guidelines and beyond.Neurology, 97, 2021

work page 2021
[2]

Time to treatment with endovascular thrombectomy and outcomes from ischemic stroke: Ameta-analysis.JAMA - Journal of the American Medical Association, 316:1279–1288, 9 2016

Jeffrey L Saver, Mayank Goyal, Aad van der Lugt, Charles B L M Majoie Bijoy K Menon, Diederik W Dippel, Bruce C Campbell, Raul G Nogueira, Andrew M Demchuk, Alejandro Tomasello, Pere Cardona, Thomas G Devlin, Donald F Frei, Richard du Mesnil de Rochemont, Olvert A Berkhemer, Tudor G Jovin, Adnan H Sid- diqui, Wim H van Zwam, Stephen M Davis, Carlos Casta˜...

work page 2016
[3]

Ssnap annual report 2024

Sentinel Stroke National Audit Programme. Ssnap annual report 2024. Healthcare Quality Improvement Part- nership, 11 2024. Accessed 5 March 2025

work page 2024
[4]

O. A. Berkhemer, P. S. S. Fransen, D. Beumer, L. A. van den Berg, H. F. Lingsma, A. J. Yoo, W. J. Schonewille, J. A. V os, P. J. Nederkoorn, M. J. H. Wermer, M. A. A. van Walderveen, J. Staals, J. Hofmeijer, J. A. van Oostayen, G. J. Lycklama a Nijeholt, J. Boiten, P. A. Brouwer, B. J. Emmer, S. F. de Bruijn, L. C. van Dijk, L. J. Kappelle, R. H. Lo, E. J...

work page 2015
[5]

Mauro, and James A

Lloyd W Klein, Donald L Miller, Stephen Balter, Warren Laskey, David Haines, Alexander Norbash, Matthew A. Mauro, and James A. Goldstein. Occupational health hazards in the interventional laboratory: Time for a safer environment.Society of Interventional Radiology, 250:538–544, 2 2009

work page 2009
[6]

Madder, Stacie VanOosterhout, Abbey Mulder, , Matthew Elmore, Jessica Campbell, Andrew Borgman, Jessica Parker, and David Wohns

Ryan D. Madder, Stacie VanOosterhout, Abbey Mulder, , Matthew Elmore, Jessica Campbell, Andrew Borgman, Jessica Parker, and David Wohns. Impact of robotics and a suspended lead suit on physician radiation exposure during percutaneous coronary intervention.Cardiovascular Revascularization Medicine, 18:190–196, 4 2017

work page 2017
[7]

Robotics in neurointerventional surgery: a systematic review of the literature.Journal of neurointerventional surgery, 14:539–545, 6 2022

William Crinnion, Ben Jackson, Avnish Sood, Jeremy Lynch, Christos Bergeles, Hongbin Liu, Kawal Rhode, Vitor Mendes Pereira, and Thomas C Booth. Robotics in neurointerventional surgery: a systematic review of the literature.Journal of neurointerventional surgery, 14:539–545, 6 2022

work page 2022
[8]

Neurosurgery and artificial intelligence.AIMS Neuroscience, 8:477–495, 2021

Mohammad Mofatteh. Neurosurgery and artificial intelligence.AIMS Neuroscience, 8:477–495, 2021

work page 2021
[9]

Harry Robertshaw, Lennart Karstensen, Benjamin Jackson, Alejandro Granados, and Thomas C. Booth. Au- tonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learn- ing.Int J CARS, 6 2024

work page 2024
[10]

Harry Robertshaw, Benjamin Jackson, Jiaheng Wang, Hadi Sadati, Lennart Karstensen, Alejandro Granados, and Thomas C. Booth. Reinforcement learning for safe autonomous two-device navigation of cerebral vessels in mechanical thrombectomy.Int J CARS, 2025

work page 2025
[11]

Harry Robertshaw, Lennart Karstensen, Benjamin Jackson, Hadi Sadati, Kawal Rhode, Sebastien Ourselin, Ale- jandro Granados, and Thomas C. Booth. Artificial intelligence in the autonomous navigation of endovascular interventions: a systematic review.Frontiers in Human Neuroscience, 2023

work page 2023
[12]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention 2015, pages 234–241, 2015

work page 2015
[13]

Fully automatic and real-time catheter segmentation in x-ray fluoroscopy

Pierre Ambrosini, D Ruijters, Wiro Niessen, Adriaan Moelker, and Theo van Walsum. Fully automatic and real-time catheter segmentation in x-ray fluoroscopy. InMedical Image Computing and Computer-Assisted Intervention 2017, 2017. 9 Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

work page 2017
[14]

Anh Nguyen, Dennis Kundrat, Giulio Dagnino, Wenqiang Chi, Mohamed E. M. K. Abdelaziz, Yao Guo, YingLiang Ma, Trevor M. Y . Kwok, Celia Riga, and Guang-Zhong Yang. End-to-end real-time catheter seg- mentation with optical flow-guided warping during endovascular intervention. In2020 IEEE International Con- ference on Robotics and Automation (ICRA), pages 99...

work page 2020
[15]

Deep segmentation and registration in x-ray angiography video

Athanasios Vlontzos and Krystian Mikolajczyk. Deep segmentation and registration in x-ray angiography video. InBritish Machine Vision Conference 2018, page 267, 2018

work page 2018
[16]

Contrack: Contextual transformer for device tracking in x-ray

Marc Demoustier, Yue Zhang, Venkatesh Murthy, Florin Ghesu, and Dorin Comaniciu. Contrack: Contextual transformer for device tracking in x-ray. InMedical Image Computing and Computer-Assisted Intervention 2013, pages 679–688, 2023

work page 2013
[17]

Cathaction: A benchmark for endovascular intervention understanding

Baoru Huang, Tuan V o, Chayun Kongtongvattana, Giulio Dagnino, Dennis Kundrat, Wenqiang Chi, Mohamed Abdelaziz, Trevor Kwok, Tudor Jianu, Tuong Do, Hieu Le, Minh Nguyen, Hoan Nguyen, Erman Tjiputra, Quang Tran, Jianyang Xie, Yanda Meng, Binod Bhattarai, Zhaorui Tan, Hongbin Liu, Hong Seng Gan, Wei Wang, Xi Yang, Qiufeng Wang, Jionglong Su, Kaizhu Huang, A...

work page 2024
[18]

Lungren, Shaoting Zhang, Lei Xing, Le Lu, Alan Yuille, and Yuyin Zhou

Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu andQingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew P. Lungren, Shaoting Zhang, Lei Xing, Le Lu, Alan Yuille, and Yuyin Zhou. Tran- sunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers. Medical Image Analysis, 97, 10 2024

work page 2024
[19]

Swin-unet: Unet-like pure transformer for medical image segmentation

Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. InComputer Vision – ECCV 2022 Workshops, pages 205–218, 2023

work page 2022
[20]

Segvit: Semantic segmentation with plain vision transformers

Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. In36th Conference on Neural Information Processing Systems, 10 2022

work page 2022
[21]

Tudor Jianu, Baoru Huang, Minh Nhat Vu, Mohamed E. M. K. Abdelaziz, Sebastiano Fichera, Chun-Yi Lee, Pierre Berthet-Rayne, Ferdinando Rodriguez y Baena, and Anh Nguyen. Cathsim: An open-source simulator for endovascular intervention.IEEE Transactions on Medical Robotics and Bionics, 2024

work page 2024
[22]

Catheter segmentation in x-ray fluoroscopy using synthetic data and transfer learning with light u-nets.Computer Methods and Programs in Biomedicine, 192:105420, 2020

Marta Gherardini, Evangelos Mazomenos, Arianna Menciassi, and Danail Stoyanov. Catheter segmentation in x-ray fluoroscopy using synthetic data and transfer learning with light u-nets.Computer Methods and Programs in Biomedicine, 192:105420, 2020

work page 2020
[23]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021. 10

work page 2021