pith. sign in

arxiv: 2606.22486 · v1 · pith:H6LSOW7Xnew · submitted 2026-06-21 · 💻 cs.CV · cs.AI· cs.HC

Human and AI collaboration for pulmonary nodule segmentation

Pith reviewed 2026-06-26 10:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords pulmonary nodule segmentationhuman-in-the-loopSegment Anything ModelDice scoremedical image annotationAI collaborationchest CTmulti-center validation
0
0 comments X

The pith

Hi-Seg lets humans iteratively refine SAM prompts to reach nearly 85% Dice on pulmonary nodule segmentation across annotator groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a human-in-the-loop framework built on the Segment Anything Model can produce accurate pulmonary nodule masks from chest CT scans even when the human annotators are non-medical personnel after only brief training. It validates the approach on scans from 1,179 patients across 12 centers, reporting a mean Dice score of almost 85% that exceeds five state-of-the-art deep learning models by 10-22% and thirteen SAM variants by 1-29%. The method also shortens annotation time for medical users while enabling non-experts to match the performance of junior medical students. This matters because expert medical annotators are scarce and pure AI segmentation remains unreliable on specialized tasks.

Core claim

Hi-Seg is a human-in-the-loop segmentation framework built on SAM in which humans iteratively refine prompts through trial-and-error learning and semantic reasoning to progressively guide the model toward higher-quality masks. On a multi-center external validation set of 1,179 patients, the framework yields a mean Dice score of almost 85% that outperforms five state-of-the-art deep learning models by 10-22% and 13 SAM variants by 1-29%. Medical annotators experience reduced annotation time, and briefly trained non-medical annotators reach performance comparable to that of junior medical students.

What carries the argument

Hi-Seg, the human-in-the-loop framework that adds iterative prompt refinement and semantic reasoning on top of the Segment Anything Model to produce medical segmentation masks.

If this is right

  • Segmentation accuracy improves while annotation time decreases for medical annotators.
  • Briefly trained non-medical annotators achieve performance comparable to junior medical students.
  • The approach supports scalable crowdsourced annotation for medical imaging tasks.
  • Clinician workload can be reduced by shifting routine segmentation to the collaborative system.
  • Foundation models can be integrated into clinical workflows with human oversight for safety and efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same iterative refinement loop could be tested on other organs or imaging modalities to check whether the performance lift generalizes beyond pulmonary nodules.
  • If the non-expert parity holds, hospitals could crowdsource large-scale labeling datasets without requiring full medical training for every annotator.
  • Embedding the prompt-refinement interface directly into radiology workstations might allow real-time human correction of AI suggestions during routine reads.
  • The framework's success on a 12-center dataset suggests it may tolerate scanner and protocol variation better than purely automated models, but this would need explicit confirmation on additional external sites.

Load-bearing premise

Iterative human prompt refinement through trial-and-error and semantic reasoning can be applied consistently by different annotator groups, including briefly trained non-medical personnel, without introducing systematic biases or variability that undermine the gains.

What would settle it

A new multi-center chest CT dataset in which non-medical annotators using Hi-Seg produce mean Dice scores below 70% or fail to exceed the five deep-learning baselines.

read the original abstract

Medical expert annotators are scarce, and blind reliance on artificial intelligence (AI) can be misleading, motivating approaches in which humans, particularly junior medical trainees or even non-medical personnel, collaborate with AI to achieve robust medical segmentation. Although the Segment Anything Model (SAM) shows promise for general-purpose image segmentation, its performance in human-AI collaboration for specialized medical tasks has not been thoroughly evaluated. Here we present Hi-Seg, a human-in-the-loop segmentation framework for pulmonary nodules built on SAM. Humans iteratively refine prompts through trial-and-error learning and semantic reasoning, progressively guiding SAM toward higher-quality masks. Using chest CT scans from 1,179 patients across 12 centers, we conducted the first large-scale external validation of collaborative human-SAM segmentation. Across all annotator groups, Hi-Seg achieved a mean Dice score of almost 85%, outperforming five state-of-the-art deep learning models by 10-22% and 13 SAM variants by 1-29%. Hi-Seg improved segmentation accuracy while reducing annotation time for medical annotators, and briefly trained non-medical annotators achieved performance comparable to that of the junior medical student. These findings suggest that human-in-the-loop segmentation can reduce clinician workload, enable scalable crowdsourced annotation, and transform clinical workflows by facilitating the safe and efficient integration of foundation models into routine clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Hi-Seg, a human-in-the-loop framework built on SAM for pulmonary nodule segmentation in chest CT. Humans (medical experts, junior trainees, and briefly trained non-medical personnel) iteratively refine prompts via trial-and-error and semantic reasoning. On 1,179 patients from 12 centers, Hi-Seg reports mean Dice of ~85%, outperforming five SOTA deep learning models by 10-22% and 13 SAM variants by 1-29%, while reducing annotation time; non-medical annotators match junior medical student performance.

Significance. If the empirical results hold under rigorous controls, the work could enable scalable crowdsourced annotation and reduce expert workload in medical imaging, supporting safer integration of foundation models into clinical practice.

major comments (2)
  1. [Abstract] Abstract and implied Methods: the headline claim of mean Dice ~85% across all annotator groups (outperforming DL models and SAM variants) is load-bearing for the central contribution, yet no details are given on the iterative refinement protocol, inter-annotator agreement, per-group variance, or statistical significance testing. This directly affects the weakest assumption that prompt refinement works consistently for non-medical personnel on the multi-center dataset.
  2. [Results] Results section (performance comparisons): the reported 10-22% gains over five state-of-the-art deep learning models and 1-29% over 13 SAM variants lack description of baseline implementations, exact evaluation protocol, or controls for human variability, making it impossible to determine whether the gains are attributable to Hi-Seg or to differences in setup.
minor comments (1)
  1. [Abstract] The abstract states this is the 'first large-scale external validation' but provides no explicit description of the train/test split, center-wise stratification, or how the 12-center dataset was partitioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify aspects of our work. We address each major comment below with references to the manuscript and indicate where revisions will be made for improved clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract and implied Methods: the headline claim of mean Dice ~85% across all annotator groups (outperforming DL models and SAM variants) is load-bearing for the central contribution, yet no details are given on the iterative refinement protocol, inter-annotator agreement, per-group variance, or statistical significance testing. This directly affects the weakest assumption that prompt refinement works consistently for non-medical personnel on the multi-center dataset.

    Authors: The iterative refinement protocol, including trial-and-error learning and semantic reasoning steps, is described in detail in Section 3.2 of the Methods. Inter-annotator agreement (Cohen's kappa) and per-group variance (standard deviations) are reported in Results Section 4.2 and Table 2. Statistical significance testing via paired t-tests is included in Supplementary Table S1. For non-medical personnel, Section 4.4 directly compares their performance to junior medical students on the multi-center data, showing comparable results with no significant difference. To address the concern about the abstract, we will add a concise sentence summarizing the protocol and testing approach. revision: partial

  2. Referee: [Results] Results section (performance comparisons): the reported 10-22% gains over five state-of-the-art deep learning models and 1-29% over 13 SAM variants lack description of baseline implementations, exact evaluation protocol, or controls for human variability, making it impossible to determine whether the gains are attributable to Hi-Seg or to differences in setup.

    Authors: Baseline implementations are specified in Methods Section 3.4, noting that the five DL models were re-implemented from official code on identical dataset splits and the 13 SAM variants followed their original configurations. The evaluation protocol uses 5-fold cross-validation on the held-out multi-center test set with Dice as the metric. Human variability is controlled by having multiple annotators per group segment the same cases, with averaged results and variance reported in Table 3. We agree these elements merit more explicit highlighting and will insert a clarifying paragraph in the Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external multi-center data

full rationale

The paper reports empirical Dice scores (~85% mean) for the Hi-Seg human-in-the-loop framework on chest CT scans from 1,179 patients across 12 centers, with direct comparisons to DL models and SAM variants. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on external dataset measurements rather than self-referential reductions, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of the human-in-the-loop process and the representativeness of the 1,179-patient multi-center dataset; no free parameters, invented entities, or additional axioms beyond standard evaluation metrics are introduced.

axioms (1)
  • domain assumption The Dice coefficient is an appropriate metric for evaluating segmentation quality in medical imaging.
    Widely used standard in the field for measuring overlap between predicted and reference masks.

pith-pipeline@v0.9.1-grok · 5809 in / 1298 out tokens · 41489 ms · 2026-06-26T10:46:22.621226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 1 canonical work pages

  1. [1]

    Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60- 88 (2017)

  2. [2]

    Zhou, S.K. et al. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 109, 820-838 (2021)

  3. [3]

    & Xia, Y

    Zeng, Q., Lu, Z., Xie, Y . & Xia, Y . PICK: Predict and Mask for Semi-supervised Medical Image Segmentation. Int. J. Comput. Vision 133, 3296-3311 (2025)

  4. [4]

    Kirillov, A. et al. Segment anything. In Proc. ICCV 4015-4026 (2023)

  5. [5]

    Mazurowski, M.A. et al. Segment anything model for medical image analysis: An experimental study. Med. Image Anal. 89, 102918 (2023)

  6. [6]

    Huang, Y . et al. Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024)

  7. [7]

    Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024)

  8. [8]

    & Maier -Hein, K.H

    Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J. & Maier -Hein, K.H. nnU -Net: a self - configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203-211 (2021)

  9. [9]

    & Zhou, T

    Zhi, L., Jiang, W., Zhang, S. & Zhou, T. Deep neural network pulmonary nodule segmentation methods for CT images: Literature review and experimental comparisons. Comput. Biol. Med. 164, 107321 (2023)

  10. [10]

    Xi, Y . et al. CoreFormer high fidelity pulmonary nodule segmentation with structural core priors and geodesic implicit fields. NPJ Digit. Med. 9, 48 (2026)

  11. [11]

    Balancing human and AI roles in clinical imaging

    Gilbert, F. Balancing human and AI roles in clinical imaging. Nat. Med. 29, 1609-1610 (2023)

  12. [12]

    Tschandl, P. et al. Human -computer collaboration for skin cancer recognition. Nat. Med. 26, 1229-1234 (2020)

  13. [13]

    Armato III, S.G. et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38, 915-931 (2011)

  14. [14]

    Tian, J. et al. A multicenter clinical trial on the diagnostic value of dual -tracer PET/CT in pulmonary lesions using 3'-deoxy-3'-18F-fluorothymidine and 18F-FDG. J. Nucl. Med. 49, 186- 194 (2008)

  15. [15]

    Xu, B.X. et al. The influence of interpreters' professional background and experience on the interpretation of multimodality imaging of pulmonary lesions using 18F-3'-deoxy- fluorothymidine and 18F-fluorodeoxyglucose PET/CT. PLoS One 8, e60104 (2013)

  16. [16]

    Hofmanninger, J. et al. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 4, 50 (2020)

  17. [17]

    & Vese, L.A

    Chan, T.F. & Vese, L.A. Active contours without edges. IEEE Trans. Image Process. 10, 266- 277 (2001)

  18. [18]

    Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  19. [19]

    & Brox, T

    Ronneberger, O., Fischer, P. & Brox, T. U -net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer -Assisted Intervention –MICCAI 2015, V ol. 9351 234-241 (Springer, 2015)

  20. [20]

    & Liang, J

    Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. & Liang, J. Unet++: A nested u -net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Supportt. DLMIA ML -CDS 2018. Lecture Notes in Computer Science, V ol. 11045 3-11 (Springer, 2018)

  21. [21]

    Oktay, O. et al. Attention u -net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)

  22. [22]

    Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)

  23. [23]

    & Wang, C

    Astaraki, M., Smedby, O. & Wang, C. Prior -aware autoencoders for lung pathology segmentation. Med. Image Anal. 80, 102491 (2022)

  24. [24]

    Ke, L. et al. Segment anything in high quality. Adv. Neural Inf. Process. Syst. 36, 29914-29934 (2024)

  25. [25]

    Zhu, J., Qi, Y . & Wu, J. Medical sam 2: Segment medical images as video via segment anything model 2. arXiv preprint arXiv:2408.00874 (2024)

  26. [26]

    Ravi, N. et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  27. [27]

    He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30-36 (2019)

  28. [28]

    Tang, R. et al. Pan -mediastinal neoplasm diagnosis via nationwide federated learning: a multicentre cohort study. Lancet Digit. Health 5, e560-e570 (2023)

  29. [29]

    & Hoskin, P.J

    Brooks, C., Miles, E. & Hoskin, P.J. Radiotherapy trial quality assurance processes: a systematic review. Lancet Oncol. 25, e104-e113 (2024)

  30. [30]

    Hosny, A. et al. Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study. Lancet Digit. Health 4, e657-e666 (2022)

  31. [31]

    Crane, C.H. et al. Phase II study of bevacizumab with concurrent capecitabine and radiation followed by maintenance gemcitabine and bevacizumab for locally advanced pancreatic cancer: Radiation Therapy Oncology Group RTOG 0411. J. Clin. Oncol. 27, 4096-4102 (2009)

  32. [32]

    Jenkins, A. et al. Novel methodology to assess the effect of contouring variation on treatment outcome. Med. Phys. 48, 3234-3242 (2021)

  33. [33]

    Verginadis, I.I. et al. Radiotherapy toxicities: mechanisms, management, and future directions. Lancet 405, 338-352 (2025)

  34. [34]

    & Woitek, R

    Ursprung, S. & Woitek, R. The Steep Road to Artificial Intelligence –mediated Radiology. Radiol. Artif. Intell. 5, e230017 (2023)

  35. [35]

    & Hsu, W

    Prosper, A.E., Kammer, M.N., Maldonado, F., Aberle, D.R. & Hsu, W. Expanding role of advanced image analysis in CT-detected indeterminate pulmonary nodules and early lung cancer characterization. Radiology 309, e222904 (2023)

  36. [36]

    Hanna, T.P. et al. Mortality due to cancer treatment delay: systematic review and meta-analysis. BMJ 371, m4087 (2020)

  37. [37]

    NHS Diagnostic Waiting Times and Activity Data March 2024 Monthly Report

    NHS England. NHS Diagnostic Waiting Times and Activity Data March 2024 Monthly Report. Available at: https://www.england.nhs.uk/statistics/statistical-work-areas/diagnostics-waiting- times-and-activity/monthly-diagnostics-waiting-times-and-activity/monthly-diagnostics-data- 2023-24/. [Accessed June 2024]

  38. [38]

    Wu, J. et al. Medical sam adapter: Adapting segment anything model for medical image segmentation. Med. Image Anal. 102, 103547 (2025)

  39. [39]

    Zhang, S. et al. A generalist foundation model and database for open -world medical image segmentation. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01497-3 (2025)

  40. [40]

    & Ravaud, P

    Crequit, P., Mansouri, G., Benchoufi, M., Vivot, A. & Ravaud, P. Mapping of Crowdsourcing in Health: Systematic Review. J. Med. Internet Res. 20, e187 (2018)

  41. [41]

    Liu, B. et al. Evolving the pulmonary nodules diagnosis from classical approaches to deep learning-aided decision support: three decades' development course and future prospect. J. Cancer Res. Clin. Oncol. 146, 153-185 (2020)

  42. [42]

    & Feuerhake, F

    Grote, A., Schaadt, N.S., Forestier, G., Wemmert, C. & Feuerhake, F. Crowdsourcing of Histological Image Labeling and Object Delineation by Medical Students. IEEE Trans. Med. Imaging 38, 1284-1294 (2019)

  43. [43]

    Gaede, S. et al. An evaluation of an automated 4D -CT contour propagation tool to define an internal gross tumour volume for lung cancer radiotherapy. Radiother. Oncol. 101, 322 -328 (2011)

  44. [44]

    van Klaveren, R.J. et al. Management of lung nodules detected by volume CT scanning. N. Engl. J. Med. 361, 2221-2229 (2009)

  45. [45]

    & Unberath, M

    Chen, H., Gomez, C., Huang, C.M. & Unberath, M. Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review. NPJ Digit. Med. 5, 156 (2022)

  46. [46]

    Huge foundation models are turbo -charging AI progress

    Economist, T. Huge foundation models are turbo -charging AI progress. V ol. 2024 (2022); https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are- turbo-charging-ai-progress

  47. [47]

    Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 74, 229-263 (2024)

  48. [48]

    Global cancer burden growing, amidst mountingneed for services

    World Health Organization. Global cancer burden growing, amidst mountingneed for services. (2024); https://www.who.int/news/item/01-02-2024-global-cancer-burden-growing--amidst- mounting-need-for-services Methods Dataset This section describes the patient recruitment and exclusion criteria used for dataset construction. Patient recruitment . The public Lu...

  49. [49]

    He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 16000-16009 (2022)

  50. [50]

    & Pitié, F

    Forte, M., Price, B., Cohen, S., Xu, N. & Pitié, F. Getting to 99% accuracy in interactive segmentation. arXiv preprint arXiv:2003.07932 (2020)

  51. [51]

    & Konushin, A

    Sofiiuk, K., Petrov, I.A. & Konushin, A. Reviving iterative training with mask guidance for interactive segmentation. In IEEE Int. Conf. Image Process. 3141-3145 (IEEE, 2022)

  52. [52]

    w/” and “w/o

    Zou, Z., Chen, K., Shi, Z., Guo, Y . & Ye, J. Object detection in 20 years: A survey. Proc. IEEE 111, 257-276 (2023). Figures g Manual pixel-wise delineation Deep learning network for segmentation Image encoder Mask decoder Hi-Seg (human-in-the-loop segmentation tool with SAM) Human collaborators: Senior radiologist (12 years experience) Junior medical st...