pith. sign in

arxiv: 2606.04365 · v1 · pith:AFW73HKVnew · submitted 2026-06-03 · 💻 cs.CV · cs.AI

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Pith reviewed 2026-06-28 07:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords kidney lesion characterizationper-lesion predictionCT volumesDETR architecturesegmentation mask inputabdominal pretrainingset predictionmulti-granularity labels
0
0 comments X

The pith

LesionDETR reformulates kidney CT analysis as per-lesion set prediction, outputting variable numbers of lesions each with clinical attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move beyond patient- or organ-level predictions in 3D kidney CT to a per-lesion task where one model emits a variable number of lesions per kidney, each labeled with type, size, enhancement, and attenuation. It curates multi-granularity labels from 2,619 volumes at one center and tests zero-shot on KiTS23. Two design choices dominate performance: feeding a segmentation mask as an input channel and using same-domain abdominal pretraining rather than generic pretraining. This setup reaches bilateral side-level abnormality AUC of 0.799 on internal data and 0.817 on external data. The resulting per-lesion outputs align directly with how radiologists describe findings, opening a path to automated structured reports.

Core claim

LesionDETR is a DETR-style architecture that applies size-distance Hungarian matching and a hierarchical loss aggregating per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, segmentation mask input and same-domain abdominal pretraining (SuPreM) dominate, while generic large-corpus pretraining performs no better than random initialization. The model attains bilateral side-level abnormality AUC 0.799 ± 0.009 on UF-Health and 0.817 ± 0.072 on KiTS23; a count-conditioned variant reaches per-lesion mAP 0.190 ± 0.083 on cystic lesions, though rare solid-lesion AP remains at the noise floor. The framework produces verified per-lesion predict

What carries the argument

LesionDETR, a DETR-style set-prediction model using size-distance Hungarian matching for variable lesion counts and a hierarchical loss that aggregates per-slot outputs to side-level objectives.

If this is right

  • Per-lesion outputs directly support generation of structured radiology reports from CT volumes.
  • Segmentation mask as an input channel measurably improves side-level abnormality detection.
  • Same-domain abdominal pretraining outperforms both random initialization and generic large-corpus pretraining.
  • Data volume and diversity, rather than architecture changes, are the next bottleneck for rare solid lesions.
  • The model generalizes to an external dataset in zero-shot side-level abnormality detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-lesion set-prediction approach could transfer to lesions in other organs once comparable multi-granularity labels become available.
  • Emphasis on domain-specific pretraining implies that medical imaging benefits more from targeted pretraining corpora than from further scaling of general image models.
  • Low performance on solid lesions points to targeted collection of additional examples for those classes as the highest-impact follow-up.
  • The hierarchical loss that bridges instance and aggregate objectives may apply to other medical set-prediction tasks requiring both levels of granularity.

Load-bearing premise

The multi-granularity side- and per-lesion labels curated from a single academic medical center are sufficiently accurate and consistent to support both internal evaluation and zero-shot external validation.

What would settle it

If removing the segmentation mask input channel or replacing same-domain pretraining with random initialization drops side-level AUC below 0.75 on the UF-Health dataset, the claimed dominance of those two design choices would not hold.

Figures

Figures reproduced from arXiv: 2606.04365 by Chenkun Sun, Jiang Bian, Jie Xu, Jinqian Pan, Renjie Liang, Russell Terry, Zhengkang Fan.

Figure 1
Figure 1. Figure 1: Overview of the lesion-centric characterization framework. A 3D kidney CT [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data processing workflow. (A) Cohort identification. Renal CT studies selected from the UF Health IDR by CPT code. (B) CT preprocessing. Segmenta￾tion, resampling, cropping, and two manual review stages. (C) Lesion-centric label extraction. An LLM extracted per-lesion attributes from radiology reports, followed by two-annotator verification. The two pipelines converge into a dataset of 788 patients with mu… view at source ↗
Figure 3
Figure 3. Figure 3: Per-lesion detection head and hierarchical supervision. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data efficiency curves. (a) Side-level bilateral AUC for abnormality, cyst, and solid. (b) Per-lesion average precision (APcyst, APsolid, mAP) at size tolerance τ = 1 cm. Lines and shaded bands show mean and standard deviation across three seeds. Per-lesion detection (Panel b) shows the same qualitative scaling pattern at lower absolute levels. APcyst rises from near zero at 5% to approximately 0.15 at 100… view at source ↗
read the original abstract

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reformulates kidney lesion characterization from CT as a per-lesion set-prediction task using a DETR-style architecture (LesionDETR) with size-distance Hungarian matching and hierarchical loss. It curates 2,619 volumes from 788 patients at one center (UF-Health) with multi-granularity side- and per-lesion labels for four clinical attributes, employs KiTS23 for zero-shot external validation, and reports that a segmentation mask input channel plus same-domain abdominal pretraining (SuPreM) dominate across four input representations and six encoder initializations. Side-level abnormality AUC reaches 0.799 ± 0.009 (UF-Health) and 0.817 ± 0.072 (KiTS23); a count-conditioned variant yields per-lesion mAP 0.190 ± 0.083 on cystic lesions while solid-lesion AP remains near noise floor.

Significance. If the label quality supports the comparisons, the work usefully isolates two practical design choices (mask channel, domain pretraining) over generic pretraining and correctly flags data collection rather than architecture as the next bottleneck for rare solid lesions. The held-out internal/external splits and reporting of standard deviations are strengths; the framework's potential for structured reporting is a clear downstream motivation.

major comments (2)
  1. [Dataset curation paragraph] Dataset curation paragraph: the multi-granularity side- and per-lesion labels (type, size, enhancement, attenuation) are the foundation for all ablation results and the KiTS23 zero-shot transfer, yet no inter-rater statistics, pathology correlation, explicit extraction protocol from reports, or verification procedure are described. Without these, observed gaps between mask+SuPreM and other configurations cannot be attributed to the model choices rather than label noise or center-specific annotation patterns.
  2. [Abstract and results sections] Abstract and results sections: the reported AUCs and mAP values include standard deviations but provide no information on the number of runs, data-split stratification, or statistical testing for the claimed dominance of the two design choices; this weakens the strength of the cross-representation and cross-initialization conclusions.
minor comments (2)
  1. [Abstract] The phrase 'verified per-lesion predictions' in the abstract is not supported by any explicit verification step described in the text.
  2. [Method] Notation for the hierarchical loss and size-distance matching cost could be clarified with an explicit equation reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in dataset curation and experimental reporting. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Dataset curation paragraph] Dataset curation paragraph: the multi-granularity side- and per-lesion labels (type, size, enhancement, attenuation) are the foundation for all ablation results and the KiTS23 zero-shot transfer, yet no inter-rater statistics, pathology correlation, explicit extraction protocol from reports, or verification procedure are described. Without these, observed gaps between mask+SuPreM and other configurations cannot be attributed to the model choices rather than label noise or center-specific annotation patterns.

    Authors: We agree that additional details on the label curation process are required to support attribution of performance differences to model choices. In the revised manuscript, we will expand the dataset section with an explicit description of the report extraction protocol (structured parsing of clinical radiology reports by trained annotators), the verification procedure (review by board-certified radiologists), and any inter-rater agreement statistics that were recorded during curation. Pathology correlation is not available for the majority of cases, as biopsy confirmation was not performed for benign or low-risk lesions; we will state this limitation explicitly. These additions will clarify the label generation process and strengthen the interpretation of the ablation results. revision: yes

  2. Referee: [Abstract and results sections] Abstract and results sections: the reported AUCs and mAP values include standard deviations but provide no information on the number of runs, data-split stratification, or statistical testing for the claimed dominance of the two design choices; this weakens the strength of the cross-representation and cross-initialization conclusions.

    Authors: We acknowledge that the current description of the experimental protocol is insufficient. The reported standard deviations reflect variability across multiple independent runs. In the revision, we will explicitly state the number of runs, detail the patient-level stratification used for train/validation/test splits (to avoid data leakage), and add statistical comparisons (paired t-tests or equivalent) between the dominant configurations (mask channel and SuPreM pretraining) and the alternatives. These clarifications will be included in the results section and, space permitting, referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out data with independent architecture

full rationale

The paper reports AUC and mAP metrics computed on held-out UF-Health test volumes and zero-shot KiTS23 cases, with no equations or claims that reduce these quantities to fitted parameters or self-citations by construction. LesionDETR is presented as a DETR variant with size-distance matching and hierarchical loss; these are standard architectural choices whose performance is measured externally rather than defined into the evaluation. Ablations across input representations and initializations provide independent content, and the central claims rest on data splits and external validation rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard supervised deep-learning assumptions; the model name LesionDETR and the SuPreM pretraining are engineering choices rather than new postulated entities.

pith-pipeline@v0.9.1-grok · 5817 in / 1221 out tokens · 25493 ms · 2026-06-28T07:04:34.848443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 linked inside Pith

  1. [1]

    S. G. Silverman, I. Pedrosa, J. H. Ellis, N. M. Hindman, N. Schieda, A. D. Smith, E. M. Remer, A. B. Shinagare, N. E. Curci, D. J. Rader, et al., Bosniak classification of cystic renal masses, version 2019: An update proposal and needs assessment, Radiology 292 (2) (2019) 475– 488

  2. [2]

    B. R. Herts, S. G. Silverman, N. M. Hindman, R. G. Uzzo, R. P. Hart- man, G. M. Israel, D. A. Baumgarten, C. B. Sussman, N. H. Ramaiya, Management of the incidental renal mass on CT: A white paper of the ACR incidental findings committee, Journal of the American College of Radiology 15 (2) (2018) 264–273

  3. [3]

    H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je- mal, F. Bray, Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: A Cancer Journal for Clinicians 71 (3) (2021) 209–249

  4. [4]

    Liang, Z

    R. Liang, Z. Fan, J. Pan, C. Sun, B. D. Steinberg, R. Terry, J. Xu, A clinically-grounded two-stage framework for renal CT report generation- ArXiv:2506.23584 (2025)

  5. [5]

    Zhou, et al., An end-to-end framework for kidney cancer diagnosis from multi-phase CT scans, npj Precision Oncology 5 (2021) 24

    J. Zhou, et al., An end-to-end framework for kidney cancer diagnosis from multi-phase CT scans, npj Precision Oncology 5 (2021) 24. 26

  6. [6]

    Uhm, S.-W

    K.-H. Uhm, S.-W. Jung, S.-H. Hong, S.-J. Ko, Lesion-aware cross-phase attention network for renal tumor subtype classification on multi-phase CT scans, Computers in Biology and MedicineArXiv:2406.16322 (2024)

  7. [7]

    S. Han, S. I. Hwang, H. J. Lee, The classification of renal cancer in 3-phase CT images using a deep learning method, Journal of Digital Imaging 32 (4) (2019) 638–643

  8. [8]

    Isensee, P

    F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, K. H. Maier-Hein, nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2) (2021) 203–211

  9. [9]

    Wasserthal, H.-C

    J. Wasserthal, H.-C. Breit, M. T. Meyer, et al., TotalSegmentator: Ro- bust segmentation of 104 anatomic structures in CT images, Radiology: Artificial Intelligence 5 (5) (2023) e230024

  10. [10]

    Myronenko, et al., Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge, in: Springer LNCS, 2023, arXiv:2310.04110

    A. Myronenko, et al., Automated 3D segmentation of kidneys and tumors in MICCAI KiTS 2023 challenge, in: Springer LNCS, 2023, arXiv:2310.04110

  11. [11]

    I.L.Xi, Y.Zhao, R.Fishman, M.Kamyab, P.Lipton, B.Lehrer, M.Hao, S. S. Raman, Stratification of cystic renal masses into benign and po- tentially malignant: Applying machine learning to the Bosniak classifi- cation, European Radiology 30 (2020) 2817–2826

  12. [12]

    M. E. Klontzas, G. Kalarakis, E. Koltsakis, T. Papathomas, A. H. Karantanas, A. Tzortzakakis, Convolutional neural networks for the dif- ferentiation between benign and malignant renal tumors with a multi- center international computed tomography dataset, Insights into Imag- ing 15 (2024) 48

  13. [13]

    Uhlig, A

    J. Uhlig, A. Parakh, B. Sauer, et al., Deep learning and radiomic feature- based blending ensemble classifier for malignancy risk prediction in cys- tic renal lesions, Insights into Imaging 13 (2022) 6

  14. [14]

    K. Yan, Y. Tang, Y. Peng, V. Sandfort, M. Bagheri, Z. Lu, R. M. Sum- mers, MULAN: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation, in: Medical Image Com- puting and Computer-Assisted Intervention (MICCAI), Vol. 11769 of LNCS, 2019, pp. 194–202. 27

  15. [15]

    Zlocha, Q

    M. Zlocha, Q. Dou, B. Glocker, Improving RetinaNet for CT lesion de- tection with dense masks from weak RECIST labels, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 11769 of LNCS, 2019, pp. 402–410

  16. [16]

    Oktay, J

    O. Oktay, J. Schlemper, L. Le Folgoc, M. Lee, M. Heinrich, K. Mis- awa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, D. Rueckert, Attention U-Net: Learning where to look for the pancreas, in: Medical Imaging with Deep Learning (MIDL), 2018, arXiv:1804.03999

  17. [17]

    Kamnitsas, C

    K. Kamnitsas, C. Ledig, V. F. J. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation, Medical Image Analysis 36 (2017) 61–78

  18. [18]

    Kalisch, et al., CT-GRAPH: Hierarchical graph attention network for anatomy-guided CT report generation, in: ICCV Workshop, 2025, arXiv:2508.05375

    H. Kalisch, et al., CT-GRAPH: Hierarchical graph attention network for anatomy-guided CT report generation, in: ICCV Workshop, 2025, arXiv:2508.05375

  19. [19]

    Baharoon, et al., Exploring the design space of 3D MLLMs for CT re- port generation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025, arXiv:2506.21535

    M. Baharoon, et al., Exploring the design space of 3D MLLMs for CT re- port generation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025, arXiv:2506.21535

  20. [20]

    Tajbakhsh, J

    N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, J. Liang, Convolutional neural networks for medical im- ageanalysis: Fulltrainingorfinetuning?, IEEETransactionsonMedical Imaging 35 (5) (2016) 1299–1312

  21. [21]

    Raghu, C

    M. Raghu, C. Zhang, J. Kleinberg, S. Bengio, Transfusion: Understand- ing transfer learning for medical imaging, in: Advances in Neural Infor- mation Processing Systems (NeurIPS), 2019

  22. [22]

    Li, et al., How well do supervised 3D models transfer to medical imaging tasks?, in: International Conference on Learning Representa- tions (ICLR), 2024, oral

    W. Li, et al., How well do supervised 3D models transfer to medical imaging tasks?, in: International Conference on Learning Representa- tions (ICLR), 2024, oral. GitHub: MrGiovanni/SuPreM

  23. [23]

    L. Wu, J. Zhuang, et al., VoCo: A simple-yet-effective volume contrastive learning framework for 3D medical image analysis, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition 28 (CVPR), 2024, arXiv:2402.17300. Extended version: IEEE TPAMI

  24. [24]

    GitHub: Luffy503/VoCo

  25. [25]

    I. E. Hamamci, et al., Developing generalist foundation models from a multimodal dataset for 3D computed tomography, Nature Biomedical EngineeringArXiv:2403.17834 (2025)

  26. [26]

    Z. Zhou, V. Sodha, J. Pang, M. B. Gotway, J. Liang, Models genesis, Medical Image Analysis 67 (2021) 101840

  27. [27]

    Huang, et al., STU-Net: Scalable and transferable medical im- age segmentation models empowered by large-scale supervised pre- trainingArXiv:2304.06716

    Z. Huang, et al., STU-Net: Scalable and transferable medical im- age segmentation models empowered by large-scale supervised pre- trainingArXiv:2304.06716. GitHub: uni-medical/STU-Net (2023)

  28. [28]

    Blankemeier, et al., Merlin: A computed tomography vision-language foundation model and dataset, NatureArXiv:2406.06512 (2026)

    L. Blankemeier, et al., Merlin: A computed tomography vision-language foundation model and dataset, NatureArXiv:2406.06512 (2026)

  29. [29]

    Bai, et al., M3D: Advancing 3D medical image analysis with multi- modal large language modelsArXiv:2404.00578 (2024)

    F. Bai, et al., M3D: Advancing 3D medical image analysis with multi- modal large language modelsArXiv:2404.00578 (2024)

  30. [30]

    Haghgoo, R

    J.Irvin, P.Rajpurkar, M.Ko, Y.Yu, S.Ciurea-Ilcus, C.Chute, H.Mark- lund, B. Haghgoo, R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, A. Y. Ng, CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison, in: AAAI Con- ference on A...

  31. [31]

    L. C. Adams, D. Truhn, F. Busch, A. Kader, S. M. Niehues, M. R. Makowski, K. K. Bressem, Leveraging GPT-4 for post hoc transforma- tion of free-text radiology reports into structured reporting: A multilin- gual feasibility study, Radiology 307 (4) (2023) e230725

  32. [32]

    Bassi, et al., RadGPT: Constructing 3D image-text tumor datasets, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2025, arXiv:2501.04678

    P. Bassi, et al., RadGPT: Constructing 3D image-text tumor datasets, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2025, arXiv:2501.04678

  33. [33]

    Y. Zhao, X. Wang, T. Che, G. Bao, S. Li, Multi-task deep learning for medical image computing and analysis: A review, Computers in Biology and Medicine 153 (2023) 106496. 29

  34. [34]

    12346 of LNCS, 2020, pp

    N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European Confer- ence on Computer Vision (ECCV), Vol. 12346 of LNCS, 2020, pp. 213– 229

  35. [35]

    H. Li, et al., Transformer-powered precision: A DETR-based approach for robust detection in medical ultrasound with cholelithiasis as a case study, Computational and Structural Biotechnology Journal (2025)

  36. [36]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft COCO: Common objects in context, in: European Conference on Computer Vision (ECCV), 2014, pp. 740– 755

  37. [37]

    Y. Tang, et al., Self-supervised pre-training of Swin Transformers for 3D medical image analysis, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, arXiv:2111.14791

  38. [38]

    multiple additional smaller cysts

    T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense objectdetection, in: IEEEInternationalConferenceonComputerVision (ICCV), 2017, pp. 2980–2988. 30 Appendix A. Representative report snippets Sample 1. FINDINGS:Kidneys: Right kidney: No hydronephrosis. No obstructing calculus. There is a mildly lobular, solid heterogeneously enhancing...