Understanding Model Behavior in Monocular Polyp Sizing
Pith reviewed 2026-05-21 06:38 UTC · model grok-4.3
The pith
Monocular polyp sizing models perform consistently across input types because they rely on examination behavior cues instead of true metric scales, and segmentation errors under shift block most gains from better scale data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across architectures and input modalities, model performance on polyp size classification remains moderately consistent, indicating dependence on examination behavior cues rather than metric scale. Ground-truth scale at varying granularities quantifies limited gains from depth estimation and global calibration. Oracle scale paired with predicted masks under distribution shift recovers only baseline performance, establishing metric scale and mask robustness as independent bottlenecks.
What carries the argument
Oracle scale ladders combined with mask substitution, which isolate the separate effects of perfect scale information and segmentation accuracy by swapping in ground-truth masks or scale values at test time.
If this is right
- Perfect scale information improves classification only when polyp masks are accurate.
- Segmentation errors under distribution shift remove most potential benefit from better scale data.
- Current depth estimation and global calibration deliver limited gains for polyp sizing.
- Tools such as oracle scale ladders, shortcut partitions, and mask substitution enable targeted auditing of future sizing pipelines.
Where Pith is reading between the lines
- The auditing approach could be applied to other monocular medical measurement tasks where both scale and object boundaries matter.
- Patient-stratified analysis might reveal whether the learned cues are tied to specific endoscopists or equipment.
- Combining more robust segmentation with scale-aware training could be tested as a way to address both bottlenecks together.
Load-bearing premise
Moderately consistent performance across RGB appearance, relative depth, and photometry inputs means the models are using examination behavior cues instead of true metric scales or other shared dataset properties.
What would settle it
A large accuracy increase on held-out centers when models receive both ground-truth masks and oracle scale, compared with the same scale but predicted masks, would confirm that mask robustness is an independent limit.
Figures
read the original abstract
Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical diagnostic audit of binary polyp size classification (≤5 mm vs. >5 mm) for monocular colonoscopy. Using multiple public multi-center datasets, various model architectures, and patient-stratified cross-validation, it evaluates performance across input modalities (RGB appearance, relative depth, photometry). The authors report moderately consistent results suggesting reliance on examination-behavior cues rather than true metric scales. Oracle scale ladders quantify potential gains from perfect scale information, while mask-substitution experiments under distribution shift show that segmentation errors largely eliminate those gains. The work identifies metric scale and mask robustness as independent bottlenecks and releases reusable evaluation tools (oracle ladders, shortcut partitions, mask substitution) along with public code.
Significance. If the central empirical findings hold, the paper supplies practical auditing tools for future monocular polyp-sizing pipelines in medical computer vision. Multi-center data, patient-stratified splits, and systematic oracle ablations provide a reproducible framework for isolating scale versus mask effects. The public code and concrete evaluation protocols (scale ladders, mask substitution) constitute reusable contributions that can be applied beyond the current datasets.
major comments (2)
- [Modality-consistency results (abstract and §4)] Modality-consistency results (abstract and §4): the claim that moderately consistent performance across RGB, relative depth, and photometry indicates reliance on examination-behavior cues rather than metric scales rests on the assumption that these derived modalities do not retain correlated low-level size-predictive signals. Because relative depth and photometry are generated from the same source RGB frames, shared texture gradients, lighting patterns, or center-specific artifacts could still correlate with size labels; the current experiments do not isolate or rule out these alternatives.
- [Oracle-scale and mask-substitution ablations (§5)] Oracle-scale and mask-substitution ablations (§5): while the experiments demonstrate that perfect scale information yields gains and that predicted masks under shift largely negate them, the manuscript should report exact performance deltas, confidence intervals, and statistical significance for the key comparisons (baseline vs. oracle-scale, oracle-scale vs. oracle-scale+predicted-mask) to make the “independent bottlenecks” conclusion quantitatively precise.
minor comments (2)
- [Abstract] Abstract: quantitative performance numbers, error bars, and exact effect sizes for the main claims are missing; adding them would allow readers to assess the magnitude of the reported consistency and oracle gains directly.
- [Methods and figures] Notation and figure clarity: ensure that all modality-conversion steps (how relative depth and photometry are obtained from RGB) are described with sufficient detail for exact reproduction, and label the oracle-ladder plots with the precise granularity levels used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. The comments help clarify the interpretation of our modality-consistency experiments and strengthen the quantitative support for our conclusions on independent bottlenecks. We address each major comment below.
read point-by-point responses
-
Referee: [Modality-consistency results (abstract and §4)] Modality-consistency results (abstract and §4): the claim that moderately consistent performance across RGB, relative depth, and photometry indicates reliance on examination-behavior cues rather than metric scales rests on the assumption that these derived modalities do not retain correlated low-level size-predictive signals. Because relative depth and photometry are generated from the same source RGB frames, shared texture gradients, lighting patterns, or center-specific artifacts could still correlate with size labels; the current experiments do not isolate or rule out these alternatives.
Authors: We appreciate the referee's point on potential residual correlations. Relative depth maps are produced by monocular estimators that are scale-ambiguous by design and typically normalized to a relative range, while photometry extracts shading and illumination cues after explicit normalization steps that remove absolute intensity scales. The moderate consistency across these inputs therefore indicates that predictive signals are largely shared and tied to examination behavior (e.g., centering, distance proxies via appearance patterns) rather than modality-specific metric information. We acknowledge that low-level shared artifacts cannot be entirely excluded without further controlled synthetic experiments. In revision we will add a clarifying paragraph in §4 that explicitly states the normalization properties of each derived modality and notes this as a limitation of the current design. revision: partial
-
Referee: [Oracle-scale and mask-substitution ablations (§5)] Oracle-scale and mask-substitution ablations (§5): while the experiments demonstrate that perfect scale information yields gains and that predicted masks under shift largely negate them, the manuscript should report exact performance deltas, confidence intervals, and statistical significance for the key comparisons (baseline vs. oracle-scale, oracle-scale vs. oracle-scale+predicted-mask) to make the “independent bottlenecks” conclusion quantitatively precise.
Authors: We agree that reporting exact deltas, confidence intervals, and statistical tests will make the independent-bottlenecks claim more rigorous. In the revised manuscript we will augment §5 and the associated tables with (i) mean accuracy/F1 deltas with 95% bootstrap confidence intervals and (ii) paired statistical significance tests (McNemar or Wilcoxon signed-rank, as appropriate) for the three key comparisons: baseline vs. oracle-scale, oracle-scale vs. oracle-scale under predicted masks, and baseline vs. oracle-scale under predicted masks. These additions will be included in both the main text and supplementary material. revision: yes
Circularity Check
No circularity: empirical audit of model behavior with no derivational reductions
full rationale
The paper conducts an empirical diagnostic audit of binary polyp size classification across public datasets, model architectures, and input modalities (RGB, relative depth, photometry). It reports observed performance consistencies, quantifies gains from oracle scale ladders and mask substitutions, and identifies bottlenecks through direct experimentation and cross-validation. No mathematical derivations, first-principles predictions, or equations are presented that reduce to fitted inputs or self-citations by construction; all claims rest on external benchmarks and reusable evaluation tools. The analysis is self-contained with no load-bearing self-citation chains or self-definitional steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ground-truth polyp size labels in the public multi-center datasets are accurate and consistent for binary <=5 mm vs >5 mm classification
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
monocular colonoscopy lacks a reliable metric reference... apparent scale A and procedure context B can correlate with the size label S, providing shortcuts
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
oracle per-frame scale yields a +16.1pp gain... segmentation errors... erase the gains from oracle scale
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
Scientific Data10(1), 671 (2023)
Azagra, P., Sostres, C., Ferrández, Á., Riazuelo, L., Tomasini, C., Barbed, O.L., Morlana, J., Recasens, D., Batlle, V.M., Gómez-Rodríguez, J.J., et al.: Endomap- per dataset of complete calibrated endoscopy procedures. Scientific Data10(1), 671 (2023)
work page 2023
-
[3]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Scientific Data11(1), 539 (2024)
Biffi, C., Antonelli, G., Bernhofer, S., Hassan, C., Hirata, D., Iwatate, M., Maieron, A., Salvagnini, P., Cherubini, A.: Real-colon: A dataset for developing real-world ai applications in colonoscopy. Scientific Data11(1), 539 (2024)
work page 2024
-
[5]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Bonilla, S., Zhang, S., Psychogyios, D., Stoyanov, D., Vasconcelos, F., Bano, S.: Gaussian pancakes: geometrically-regularized 3d gaussian splatting for realistic en- doscopicreconstruction.In:InternationalConferenceonMedicalImageComputing and Computer-Assisted Intervention. pp. 274–283. Springer (2024)
work page 2024
-
[7]
Nature Communications11(1), 3673 (2020)
Castro, D.C., Walker, I., Glocker, B.: Causality matters in medical imaging. Nature Communications11(1), 3673 (2020)
work page 2020
-
[8]
Gastrointestinal endoscopy80(4), 652–659 (2014)
Chaptini, L., Chaaya, A., Depalma, F., Hunter, K., Peikin, S., Laine, L.: Variation in polyp size estimation among endoscopists and impact on surveillance intervals. Gastrointestinal endoscopy80(4), 652–659 (2014)
work page 2014
-
[9]
Official journal of the American College of Gastroenterology| ACG pp
Cheloff, A.Z., Kim, L., Pochapin, M.B., Shaukat, A., Popov, V.: Accuracy of visual estimation for measuring colonic polyp size: a systematic review and meta-analysis. Official journal of the American College of Gastroenterology| ACG pp. 10–14309 (2022)
work page 2022
-
[10]
Nature Machine Intelligence3(7), 610–619 (2021)
DeGrave, A.J., Janizek, J.D., Lee, S.I.: Ai for radiographic covid-19 detection se- lects shortcuts over signal. Nature Machine Intelligence3(7), 610–619 (2021)
work page 2021
-
[11]
Dong, B., Wang, W., Fan, D.P., Li, J., Fu, H., Shao, L.: Polyp-pvt: Polyp segmen- tation with pyramid vision transformers. arXiv preprint arXiv:2108.06932 (2021)
-
[12]
In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI)
Du,S.,Zhang,Q.,Zhang,Z.,Cai,C.,Li,X.,Qian,D.:Polypsizeestimationbygen- eralizing metric depth estimation and monocular 3d reconstruction. In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2024)
work page 2024
-
[13]
Nature Machine In- telligence2(11), 665–673 (2020)
Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Machine In- telligence2(11), 665–673 (2020)
work page 2020
-
[14]
Official journal of the American College of Gastroenterology| ACG115(3), 415–434 (2020) 10 X
Gupta, S., Lieberman, D., Anderson, J.C., Burke, C.A., Dominitz, J.A., Kaltenbach, T., Robertson, D.J., Shaukat, A., Syngal, S., Rex, D.K.: Recommen- dations for follow-up after colonoscopy and polypectomy: a consensus update by the us multi-society task force on colorectal cancer. Official journal of the American College of Gastroenterology| ACG115(3), 4...
work page 2020
-
[15]
Dig Endosc34(7), 1478–80 (2022)
Hewett, D.G.: Measurement of polyp size at colonoscopy: addressing human and technology bias. Dig Endosc34(7), 1478–80 (2022)
work page 2022
-
[16]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Iranzo, R., Batlle, V.M., Tardós, J.D., Montiel, J.M.: Endometric: Near-light monocular metric scale estimation in endoscopy. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 180–190. Springer (2025)
work page 2025
-
[17]
International Journal of Computer Assisted Radiology and Surgery 16(10), 1817–1828 (2021)
Itoh, H., Oda, M., Jiang, K., Mori, Y., Misawa, M., Kudo, S.E., Imai, K., Ito, S., Hotta, K., Mori, K.: Binary polyp-size classification based on deep-learned spatial information. International Journal of Computer Assisted Radiology and Surgery 16(10), 1817–1828 (2021)
work page 2021
-
[18]
In: International conference on medical image computing and computer-assisted intervention
Itoh, H., Roth, H.R., Lu, L., Oda, M., Misawa, M., Mori, Y., Kudo, S.e., Mori, K.: Towards automated colonoscopy diagnosis: binary polyp size estimation via unsu- pervised depth learning. In: International conference on medical image computing and computer-assisted intervention. pp. 611–619. Springer (2018)
work page 2018
-
[19]
Krueger, D., Caballero, E., Jacobsen, J.H., Zhang, A., Binas, J., Zhang, D., Priol, R.L., Courville, A.: Out-of-distribution generalization via risk extrapolation (rex). In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Confer- ence on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 5815–5826. PMLR (18–24 Jul ...
work page 2021
-
[20]
arXiv preprint arXiv:2512.18159 (2025)
Li, H., Lu, D., Wang, J., Webster III, R.J., Oguz, I.: Endostreamdepth: Tempo- rally consistent monocular depth estimation for endoscopic video streams. arXiv preprint arXiv:2512.18159 (2025)
-
[21]
Liu, R., Wang, L., Mingming, Z., Zhang, J., HAOYU, Z., Liu, X., Cheng, X., Chan, S., Sheng, D., Yan, Y., et al.: Polypsense3d: A multi-source benchmark dataset for depth-aware polyp size measurement in endoscopy. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)
work page 2025
-
[22]
Liu, Y., Yu, D., Liu, L., Xiao, Y., He, T.: Metriccol: Metric depth and pose estima- tionincolonoscopyviageometricconsistencyanddomainadaptation.In:202510th International Conference on Image, Vision and Computing (ICIVC). pp. 364–369. IEEE (2025)
work page 2025
-
[23]
Gastrointestinal endoscopy93(4), 960–967 (2021)
Misawa, M., Kudo, S.e., Mori, Y., Hotta, K., Ohtsuka, K., Matsuda, T., Saito, S., Kudo, T., Baba, T., Ishida, F., et al.: Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointestinal endoscopy93(4), 960–967 (2021)
work page 2021
-
[24]
In: European Conference on Computer Vision
Paruchuri, A., Ehrenstein, S., Wang, S., Fried, I., Pizer, S.M., Niethammer, M., Sengupta, R.: Leveraging near-field lighting for monocular depth estimation from endoscopy videos. In: European Conference on Computer Vision. pp. 473–491. Springer (2024)
work page 2024
-
[25]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Roffo, G., Biffi, C., Salvagnini, P., Cherubini, A.: Feature selection gates with gradient routing for endoscopic image computing. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 339–349. Springer (2024)
work page 2024
-
[26]
In: 2024 IEEE International Symposium on Biomedical Imag- ing (ISBI)
Ruano, J., Bravo, D., Giraldo, D., Gómez, M., González, F.A., Manzanera, A., Romero, E.: Estimating polyp size from a single colonoscopy image using a shape- from-shading model. In: 2024 IEEE International Symposium on Biomedical Imag- ing (ISBI). pp. 1–5. IEEE (2024)
work page 2024
-
[27]
Siegel, R.L., Giaquinto, A.N., Jemal, A.: Cancer statistics, 2024. CA: a cancer journal for clinicians74(1), 12–49 (2024) Understanding Model Behavior in Monocular Polyp Sizing 11
work page 2024
-
[28]
Scientific Data12(1), 918 (2025)
Song, Y., Du, S., Wang, R., Liu, F., Lin, X., Chen, J., Li, Z., Li, Z., Yang, L., Zhang, Z., et al.: Polyp-size: a precise endoscopic dataset for ai-driven polyp sizing. Scientific Data12(1), 918 (2025)
work page 2025
-
[29]
Endoscopy55(09), 871–876 (2023)
Sudarevic, B., Sodmann, P., Kafetzis, I., Troya, J., Lux, T.J., Saßmannshausen, Z., Herlod, K., Schmidt, S.A., Brand, M., Schöttker, K., et al.: Artificial intelligence- based polyp size measurement in gastrointestinal endoscopy using the auxiliary waterjet as a reference. Endoscopy55(09), 871–876 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.