Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy
Pith reviewed 2026-05-15 02:54 UTC · model grok-4.3
The pith
A two-class SegFormer model tracks catheter tips in fluoroscopy at 4.44 mm mean error, supporting real-time robotic navigation for stroke treatment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the two-class SegFormer segmentation model, placed inside a multi-threaded pipeline and followed by two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path tracking with contour fallback, delivers 4.44 mm mean absolute tip error on moderate-complexity fluoroscopic video and exceeds earlier CathAction segmentation scores.
What carries the argument
The SegFormer transformer segmentation network combined with the described two-step filtering and greedy path-following post-processing that converts pixel masks into a stable tip coordinate.
If this is right
- Live tip coordinates become available at video rate for reinforcement-learning controllers that steer catheters autonomously.
- Segmentation Dice scores improve by as much as five percent over the previous CathAction benchmark under the same three-class task.
- The multi-threaded design keeps the full pipeline fast enough for closed-loop robotic control.
- Stable tracking under noise and partial occlusion removes one barrier to wider deployment of robotic thrombectomy systems.
Where Pith is reading between the lines
- The same pipeline structure could be reused for tracking other endovascular tools once new labeled data for those devices is collected.
- Pairing the tracker with simulated training environments would let reinforcement-learning policies be tested before any patient use.
- If tip error stays low across more device sizes, the method could support navigation in additional minimally invasive procedures beyond stroke.
Load-bearing premise
The moderate-complexity labeled videos plus the fixed post-processing rules will continue to work when imaging contrast drops or device types and occlusion patterns change in real procedures.
What would settle it
Running the same pipeline on a set of heavily occluded or low-contrast clinical fluoroscopy sequences and measuring tip error above 10 mm would show the accuracy claim does not hold under broader conditions.
Figures
read the original abstract
Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a multi-threaded real-time pipeline for catheter tip tracking in fluoroscopic images to support RL-based autonomous navigation in mechanical thrombectomy. It benchmarks U-Net, U-Net+Transformer, and SegFormer segmentation models under two-class and three-class formulations, applies post-processing (two-step component filtering, one-pixel medial skeletonization, greedy arc-length path following with contour fallback), and reports a mean absolute error of 4.44 mm for two-class SegFormer on manually labeled moderate-complexity video data, outperforming the other models and prior CathAction segmentation benchmarks by up to +5% Dice.
Significance. If the reported accuracy holds and generalizes, the work supplies a concrete, deployable tracking module that could enable closed-loop RL control for robotic thrombectomy systems, addressing geographic disparities in stroke care. The direct empirical comparison on held-out labeled frames and the explicit multi-threaded timing considerations are positive elements; however, the absence of quantitative results on heavy occlusion or device variation substantially reduces the immediate translational value.
major comments (3)
- [Abstract / Conclusion] Abstract and Conclusion: the statement that the pipeline 'maintains stable performance under challenging imaging conditions' is unsupported; all MAE (4.44 mm) and Dice figures are reported exclusively on moderate-complexity manually labeled data, with no quantitative results (MAE, Dice, or failure rate) supplied for heavy occlusion, low-contrast frames, or alternate catheter geometries.
- [Results] Results section: no training details, data-split statistics, number of frames or patients, cross-validation scheme, or error bars accompany the reported 4.44 mm MAE and Dice scores, so the statistical reliability of the claim that two-class SegFormer outperforms U-Net (4.60 mm) and U-Net+Transformer (6.20 mm) cannot be assessed.
- [Methods / Results] Methods / Results: the contribution of the heuristic post-processing steps (two-step component filtering, medial skeletonization, greedy arc-length tracking) is not isolated by ablation; it is therefore unclear whether the 4.44 mm MAE is driven by the SegFormer backbone or by the tuned post-processing pipeline.
minor comments (1)
- [Figures] Figure captions and axis labels should explicitly state the number of frames and the definition of 'moderate complexity' used for the reported metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting key limitations in our evaluation and reporting. We will revise the manuscript to qualify our claims, add the requested experimental details, and include an ablation study. These changes will strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract / Conclusion] Abstract and Conclusion: the statement that the pipeline 'maintains stable performance under challenging imaging conditions' is unsupported; all MAE (4.44 mm) and Dice figures are reported exclusively on moderate-complexity manually labeled data, with no quantitative results (MAE, Dice, or failure rate) supplied for heavy occlusion, low-contrast frames, or alternate catheter geometries.
Authors: We agree that the quantitative results are reported only on moderate-complexity data and do not support the broad claim of stable performance under challenging conditions. We will revise the abstract and conclusion to accurately describe the 4.44 mm MAE on moderate-complexity data and add a limitations paragraph in the discussion acknowledging the lack of results on heavy occlusion, low-contrast frames, and alternate geometries. revision: yes
-
Referee: [Results] Results section: no training details, data-split statistics, number of frames or patients, cross-validation scheme, or error bars accompany the reported 4.44 mm MAE and Dice scores, so the statistical reliability of the claim that two-class SegFormer outperforms U-Net (4.60 mm) and U-Net+Transformer (6.20 mm) cannot be assessed.
Authors: We will expand the results section to include full training hyperparameters, dataset statistics (number of frames and patients), the train/validation/test split details, any cross-validation procedure, and error bars or standard deviations for all reported MAE and Dice scores to enable proper assessment of statistical reliability. revision: yes
-
Referee: [Methods / Results] Methods / Results: the contribution of the heuristic post-processing steps (two-step component filtering, medial skeletonization, greedy arc-length tracking) is not isolated by ablation; it is therefore unclear whether the 4.44 mm MAE is driven by the SegFormer backbone or by the tuned post-processing pipeline.
Authors: We agree that an ablation study is needed to isolate the post-processing contribution. We will add results comparing the SegFormer model with and without the post-processing pipeline (two-step filtering, skeletonization, and arc-length tracking) to clarify the relative impact of each component on the final 4.44 mm MAE. revision: yes
Circularity Check
No circularity: direct empirical benchmarking on held-out labels
full rationale
The paper reports straightforward training and evaluation of segmentation models (U-Net, SegFormer, etc.) on manually-labeled moderate-complexity fluoroscopic video, with post-processing steps applied to produce tip coordinates whose error is measured directly against ground truth. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The central result (MAE 4.44 mm) is an independent measurement on test data; generalization statements to heavy occlusion are unsupported but do not create circularity in the reported chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- domain assumption Manually labeled tip positions in moderate-complexity fluoroscopy accurately reflect true device locations under clinical conditions
Reference graph
Works this paper leans on
-
[1]
Ashutosh P. Jadhav, Shashvat M. Desai, and Tudor G. Jovin. Indications for mechanical thrombectomy for acute ischemic stroke: Current guidelines and beyond.Neurology, 97, 2021
work page 2021
-
[2]
Jeffrey L Saver, Mayank Goyal, Aad van der Lugt, Charles B L M Majoie Bijoy K Menon, Diederik W Dippel, Bruce C Campbell, Raul G Nogueira, Andrew M Demchuk, Alejandro Tomasello, Pere Cardona, Thomas G Devlin, Donald F Frei, Richard du Mesnil de Rochemont, Olvert A Berkhemer, Tudor G Jovin, Adnan H Sid- diqui, Wim H van Zwam, Stephen M Davis, Carlos Casta˜...
work page 2016
-
[3]
Sentinel Stroke National Audit Programme. Ssnap annual report 2024. Healthcare Quality Improvement Part- nership, 11 2024. Accessed 5 March 2025
work page 2024
-
[4]
O. A. Berkhemer, P. S. S. Fransen, D. Beumer, L. A. van den Berg, H. F. Lingsma, A. J. Yoo, W. J. Schonewille, J. A. V os, P. J. Nederkoorn, M. J. H. Wermer, M. A. A. van Walderveen, J. Staals, J. Hofmeijer, J. A. van Oostayen, G. J. Lycklama a Nijeholt, J. Boiten, P. A. Brouwer, B. J. Emmer, S. F. de Bruijn, L. C. van Dijk, L. J. Kappelle, R. H. Lo, E. J...
work page 2015
-
[5]
Lloyd W Klein, Donald L Miller, Stephen Balter, Warren Laskey, David Haines, Alexander Norbash, Matthew A. Mauro, and James A. Goldstein. Occupational health hazards in the interventional laboratory: Time for a safer environment.Society of Interventional Radiology, 250:538–544, 2 2009
work page 2009
-
[6]
Ryan D. Madder, Stacie VanOosterhout, Abbey Mulder, , Matthew Elmore, Jessica Campbell, Andrew Borgman, Jessica Parker, and David Wohns. Impact of robotics and a suspended lead suit on physician radiation exposure during percutaneous coronary intervention.Cardiovascular Revascularization Medicine, 18:190–196, 4 2017
work page 2017
-
[7]
William Crinnion, Ben Jackson, Avnish Sood, Jeremy Lynch, Christos Bergeles, Hongbin Liu, Kawal Rhode, Vitor Mendes Pereira, and Thomas C Booth. Robotics in neurointerventional surgery: a systematic review of the literature.Journal of neurointerventional surgery, 14:539–545, 6 2022
work page 2022
-
[8]
Neurosurgery and artificial intelligence.AIMS Neuroscience, 8:477–495, 2021
Mohammad Mofatteh. Neurosurgery and artificial intelligence.AIMS Neuroscience, 8:477–495, 2021
work page 2021
-
[9]
Harry Robertshaw, Lennart Karstensen, Benjamin Jackson, Alejandro Granados, and Thomas C. Booth. Au- tonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learn- ing.Int J CARS, 6 2024
work page 2024
-
[10]
Harry Robertshaw, Benjamin Jackson, Jiaheng Wang, Hadi Sadati, Lennart Karstensen, Alejandro Granados, and Thomas C. Booth. Reinforcement learning for safe autonomous two-device navigation of cerebral vessels in mechanical thrombectomy.Int J CARS, 2025
work page 2025
-
[11]
Harry Robertshaw, Lennart Karstensen, Benjamin Jackson, Hadi Sadati, Kawal Rhode, Sebastien Ourselin, Ale- jandro Granados, and Thomas C. Booth. Artificial intelligence in the autonomous navigation of endovascular interventions: a systematic review.Frontiers in Human Neuroscience, 2023
work page 2023
-
[12]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention 2015, pages 234–241, 2015
work page 2015
-
[13]
Fully automatic and real-time catheter segmentation in x-ray fluoroscopy
Pierre Ambrosini, D Ruijters, Wiro Niessen, Adriaan Moelker, and Theo van Walsum. Fully automatic and real-time catheter segmentation in x-ray fluoroscopy. InMedical Image Computing and Computer-Assisted Intervention 2017, 2017. 9 Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy
work page 2017
-
[14]
Anh Nguyen, Dennis Kundrat, Giulio Dagnino, Wenqiang Chi, Mohamed E. M. K. Abdelaziz, Yao Guo, YingLiang Ma, Trevor M. Y . Kwok, Celia Riga, and Guang-Zhong Yang. End-to-end real-time catheter seg- mentation with optical flow-guided warping during endovascular intervention. In2020 IEEE International Con- ference on Robotics and Automation (ICRA), pages 99...
work page 2020
-
[15]
Deep segmentation and registration in x-ray angiography video
Athanasios Vlontzos and Krystian Mikolajczyk. Deep segmentation and registration in x-ray angiography video. InBritish Machine Vision Conference 2018, page 267, 2018
work page 2018
-
[16]
Contrack: Contextual transformer for device tracking in x-ray
Marc Demoustier, Yue Zhang, Venkatesh Murthy, Florin Ghesu, and Dorin Comaniciu. Contrack: Contextual transformer for device tracking in x-ray. InMedical Image Computing and Computer-Assisted Intervention 2013, pages 679–688, 2023
work page 2013
-
[17]
Cathaction: A benchmark for endovascular intervention understanding
Baoru Huang, Tuan V o, Chayun Kongtongvattana, Giulio Dagnino, Dennis Kundrat, Wenqiang Chi, Mohamed Abdelaziz, Trevor Kwok, Tudor Jianu, Tuong Do, Hieu Le, Minh Nguyen, Hoan Nguyen, Erman Tjiputra, Quang Tran, Jianyang Xie, Yanda Meng, Binod Bhattarai, Zhaorui Tan, Hongbin Liu, Hong Seng Gan, Wei Wang, Xi Yang, Qiufeng Wang, Jionglong Su, Kaizhu Huang, A...
work page 2024
-
[18]
Lungren, Shaoting Zhang, Lei Xing, Le Lu, Alan Yuille, and Yuyin Zhou
Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu andQingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew P. Lungren, Shaoting Zhang, Lei Xing, Le Lu, Alan Yuille, and Yuyin Zhou. Tran- sunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers. Medical Image Analysis, 97, 10 2024
work page 2024
-
[19]
Swin-unet: Unet-like pure transformer for medical image segmentation
Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. InComputer Vision – ECCV 2022 Workshops, pages 205–218, 2023
work page 2022
-
[20]
Segvit: Semantic segmentation with plain vision transformers
Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. In36th Conference on Neural Information Processing Systems, 10 2022
work page 2022
-
[21]
Tudor Jianu, Baoru Huang, Minh Nhat Vu, Mohamed E. M. K. Abdelaziz, Sebastiano Fichera, Chun-Yi Lee, Pierre Berthet-Rayne, Ferdinando Rodriguez y Baena, and Anh Nguyen. Cathsim: An open-source simulator for endovascular intervention.IEEE Transactions on Medical Robotics and Bionics, 2024
work page 2024
-
[22]
Marta Gherardini, Evangelos Mazomenos, Arianna Menciassi, and Danail Stoyanov. Catheter segmentation in x-ray fluoroscopy using synthetic data and transfer learning with light u-nets.Computer Methods and Programs in Biomedicine, 192:105420, 2020
work page 2020
-
[23]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021. 10
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.