arxiv: 2605.13038 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: unknown

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

Beilei Cui, Hongliang Ren, Liangjing Shao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords monocular depth estimationsim-to-real transfercolonoscopygeometric estimationRetinex theorywavelet decomposition3D scene reconstruction

0 comments

The pith

A model trained only on simulated colonoscopy images reaches state-of-the-art depth and 3D reconstruction accuracy on real patient data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoGE, a monocular framework that estimates depth and reconstructs colon scenes from single-camera video. Real 3D ground truth is unavailable in the narrow colon, so the method trains exclusively on simulated data and still works on realistic scenes. Two added modules close the visual gap: an illumination-aware supervisor drawn from Retinex theory that normalizes lighting differences, and a structure-aware extractor that uses wavelet decomposition to keep shared colon textures and edges. If these modules succeed, surgeons gain reliable 3D perception and navigation without collecting expensive real-world labels. Experiments show the resulting model matches or exceeds prior methods on both simulated and real test sets.

Core claim

CoGE is a novel framework for online monocular geometric estimation during colonoscopy that first applies an illumination-aware supervision module based on Retinex theory to handle lighting diversity across scenes, then adds a structure-aware perception module based on wavelet decomposition to extract common structural and local features of the colon, enabling a model trained solely on simulated data to achieve state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

What carries the argument

The illumination-aware supervision module based on Retinex theory combined with the structure-aware perception module based on wavelet decomposition, which together close the feature gap between simulated and real colonoscopy images.

If this is right

Surgeons receive accurate 3D spatial perception and navigation cues from ordinary monocular colonoscopy video.
No real-world geometric ground truth is required for training or deployment.
The same model works at top accuracy on both fully simulated test scenes and real patient footage.
Estimation runs online, supporting live use during procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to other endoscopic domains that share similar narrow-space lighting and texture challenges.
Training costs drop sharply once simulation pipelines are built, since real labeled data collection is avoided.
Real-time surgical systems might incorporate the depth output directly for tool guidance or surface measurement.

Load-bearing premise

The two new modules based on Retinex lighting theory and wavelet structure analysis are sufficient by themselves to overcome the large differences in appearance caused by artifacts and illumination between simulated and real data.

What would settle it

A large-scale test on real colonoscopy videos in which the model's depth maps or reconstructed surfaces show larger errors than the current best methods that were trained with any real geometric labels.

Figures

Figures reproduced from arXiv: 2605.13038 by Beilei Cui, Hongliang Ren, Liangjing Shao.

**Figure 1.** Figure 1: Given observations from realistic and unseen scenes, our model only trained on simulated data outputs depth maps and 3D point clouds. In this paper, we propose a foundation model-based framework, CoGE, for sim-to-real geometric estimation in colonoscopy. Firstly, to extract common structure features of colon, a structure-aware perception module based on wavelet transformation is proposed to extract local … view at source ↗

**Figure 2.** Figure 2: The pipeline of the proposed framework. 2 Method 2.1 Overview Given a pair of observations including cached previous observation It−1 and current observation It, the proposed model will output the corresponding point map Xt, camera pose Pt, illumination map Lt and confidence map Ct. Meanwhile, inspired by Spann3R [13], two spatial memory caches Mk, Mv are set and they will be updated to Mˆ k, Mˆ v after … view at source ↗

**Figure 3.** Figure 3: Attention weights distribution between feature tokens and memory caches. Memory Forget. For attention maps Wmem ∈ R d×s between feature ft−1 ∈ R d×1 and memory cache Mk, Mv ∈ R s×1 , if there is at least 5% feature tokens ft−1(j) with an attention weight Wmem(j, i) larger than 5e-4 for each memory token Mk(i), Mv(i), the memory token is seemed as the relevant tokens to be retained, demonstrated as [PITH_F… view at source ↗

**Figure 4.** Figure 4: Input images and corresponding light influence maps. Considering diverse and unstable illumination in colonoscopy, we propose a self-supervised illumination-aware (IAS) module to adjust confidence map for supervision. Based on retinex theory [7], the invariance of intrinsic component At of image It to illumination condition in the gradient domain (∇) is considered to generate illumination influence distrib… view at source ↗

**Figure 5.** Figure 5: Qualitative results of monocular depth estimation. 3D Reconstruction. The proposed method is also compared with several 3D foundation models for 3D reconstruction based on monocular videos. We utilize mean Euclidean Distance (mED) to the point cloud generated from ground truth and ratio of points with ED<N mm (δ < N) to evaluate the reconstruction accuracy. Results in Fig.6 demonstrate that our method pro… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison of 3D reconstruction [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Display of Our 3D reconstruction [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of ablation studies. realistic colonoscopy. We focus on illumination diversity and common structure feature extraction for sim-to-real geometric estimation. Our method outperforms state-of-the-art methods based on quantitative and qualitative experimental results on realistic data, with effects of our contributions. Despite of real-time [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoGE adapts Retinex illumination handling and wavelet structure extraction for sim-to-real monocular depth in colonoscopy, but real-scene SOTA rests only on qualitative visuals.

read the letter

The paper's central contribution is a training pipeline that stays entirely in simulation yet produces depth maps and reconstructions for real colonoscopy footage. It adds an illumination-aware supervision step based on Retinex theory to cope with varying lighting, plus a wavelet decomposition module to pull out shared structural features between sim and real images. These choices directly target the domain gaps the authors flag: lighting diversity and loss of local colon texture during transfer. The modules are straightforward adaptations of existing tools, which keeps the method simple to implement and avoids overclaiming novelty in the components themselves. Quantitative gains appear on simulated test sets with known ground truth, and the qualitative examples on real sequences look plausible for navigation use cases. That combination is the practical part worth noting. The soft spot is the evaluation on realistic scenes. The abstract states that real geometric ground truth is unavailable, so the SOTA claim there comes from visual inspection alone. Without numerical metrics or controlled proxies that can be compared across methods, it is hard to judge whether the transfer actually beats prior work on real data or simply looks better in selected frames. This does not break the paper, but it does make the headline performance assertion depend on subjective judgment rather than falsifiable numbers. The citation pattern is standard and does not hide obvious omissions. For readers building endoscopic navigation tools or testing sim-to-real ideas in medical imaging, the work supplies a concrete starting point. I would send it to peer review so referees can check the implementation details and suggest stronger validation for the real-data side.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoGE, a framework for online monocular geometric estimation (depth estimation and scene reconstruction) in colonoscopy. It introduces an illumination-aware supervision module based on Retinex theory to handle illumination diversity and a structure-aware perception module based on wavelet decomposition to extract common structural features. The model is trained exclusively on simulated data and claims state-of-the-art performance on both simulated and realistic scenes, supported by quantitative metrics on simulation and qualitative results on real data.

Significance. If the sim-to-real transfer holds under rigorous evaluation, the work could advance practical 3D perception in colonoscopy without requiring real-world geometric ground truth, which is difficult to obtain in the narrow enclosed anatomy. The Retinex-based illumination handling and wavelet-based structure extraction represent targeted adaptations for endoscopy domain gaps, potentially enabling better surgical navigation if the performance generalizes beyond visual inspection.

major comments (2)

[Abstract] Abstract: The central claim that the model achieves SOTA geometric estimation on realistic scenes is supported only by qualitative results. The abstract explicitly notes that geometric ground truth is difficult to obtain in real colonoscopies, so quantitative metrics (RMSE, AbsRel, etc.) are presumably computed only on simulated data. This renders the sim-to-real SOTA assertion dependent on subjective visual comparison rather than direct numerical benchmarking against baselines on real data.
[Experiments] Experiments section (assumed §4 or §5): If real-scene evaluation relies solely on qualitative inspection or indirect proxies without falsifiable quantitative comparison to prior methods, the headline performance claim on realistic scenes cannot be verified objectively. A load-bearing fix would require either proxy metrics on real data or explicit acknowledgment that real-scene superiority is qualitative only.

minor comments (1)

[Method] Ensure consistent notation for the two proposed modules (illumination-aware supervision and structure-aware perception) across the method and results sections to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to clarify the distinction between quantitative and qualitative evaluations. We agree that the manuscript should more explicitly acknowledge the limitations of real-scene assessment and will revise the abstract and experiments section to ensure claims are precisely supported by the available evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the model achieves SOTA geometric estimation on realistic scenes is supported only by qualitative results. The abstract explicitly notes that geometric ground truth is difficult to obtain in real colonoscopies, so quantitative metrics (RMSE, AbsRel, etc.) are presumably computed only on simulated data. This renders the sim-to-real SOTA assertion dependent on subjective visual comparison rather than direct numerical benchmarking against baselines on real data.

Authors: We agree that quantitative metrics cannot be computed on real data due to the unavailability of geometric ground truth, as already stated in the manuscript. In the endoscopy literature, qualitative visual comparisons are the established standard for evaluating sim-to-real transfer when ground truth is unobtainable. Our results show clear improvements in illumination handling and structural fidelity over baselines in real colonoscopy footage. We will revise the abstract to explicitly qualify that state-of-the-art performance on realistic scenes is demonstrated qualitatively. revision: yes
Referee: [Experiments] Experiments section (assumed §4 or §5): If real-scene evaluation relies solely on qualitative inspection or indirect proxies without falsifiable quantitative comparison to prior methods, the headline performance claim on realistic scenes cannot be verified objectively. A load-bearing fix would require either proxy metrics on real data or explicit acknowledgment that real-scene superiority is qualitative only.

Authors: We accept that real-scene evaluation is qualitative only. We will add an explicit statement in the Experiments section clarifying that quantitative results are restricted to simulated data while real-scene superiority is shown through visual comparisons. This revision will prevent any overstatement and align the claims with the evidence presented. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an illumination-aware supervision module based on Retinex theory and a structure-aware perception module based on wavelet decomposition as additive components to handle sim-to-real domain gaps. These are presented as independent architectural choices rather than redefinitions of the target depth or reconstruction metrics. No equations in the provided abstract or description reduce the claimed performance metrics (e.g., RMSE on geometric estimation) to quantities fitted directly on the evaluation data by construction. Training occurs exclusively on simulated data with known ground truth, while real-scene results are described qualitatively; this separation avoids any fitted-input-called-prediction pattern. Self-citations, if present in the full text, are not load-bearing for the core claim. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on two domain assumptions drawn from classical image processing rather than new postulates; no free parameters or invented entities are introduced beyond standard neural-network training.

axioms (2)

domain assumption Retinex theory can be used to separate illumination from reflectance in colonoscopy images
Invoked to build the illumination-aware supervision module that addresses lighting diversity
domain assumption Wavelet decomposition extracts structural features common to both simulated and real colon images
Basis for the structure-aware perception module

pith-pipeline@v0.9.0 · 5443 in / 1334 out tokens · 72836 ms · 2026-05-14T20:40:24.052299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages

[1]

Medical image analysis90, 102956 (2023)

Bobrow, T.L., Golhar, M., Vijayan, R., Akshintala, V.S., Garcia, J.R., Durr, N.J.: Colonoscopy 3d video dataset with paired depth from 2d-3d registration. Medical image analysis90, 102956 (2023)

2023
[2]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Cui, B., Islam, M., Bai, L., Wang, A., Ren, H.: Endodac: Efficient adapting foun- dation model for self-supervised depth estimation from any endoscopic camera. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 208–218. Springer (2024)

2024
[3]

arXiv preprint arXiv:2506.24074 (2025)

Golhar, M.V., Fretes, L.S.G., Ayers, L., Akshintala, V.S., Bobrow, T.L., Durr, N.J.: C3vdv2–colonoscopy 3d video dataset with enhanced realism. arXiv preprint arXiv:2506.24074 (2025)

work page arXiv 2025
[4]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Guo, J., Dong, W., Huang, T., Ding, H., Wang, Z., Kuang, H., Dou, Q., Liu, Y.H.: Endo3r: Unified online reconstruction from dynamic monocular endoscopic video. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 170–180. Springer (2025)

2025
[5]

Medical image analysis70, 101990 (2021)

İncetan, K., Celik, I.O., Obeid, A., Gokceler, G.I., Ozyoruk, K.B., Almalioglu, Y., Chen, R.J., Mahmood, F., Gilbert, H., Durr, N.J., et al.: Vr-caps: a virtual environment for capsule endoscopy. Medical image analysis70, 101990 (2021)

2021
[6]

IEEE Robotics and Au- tomation Letters (2025)

Jiang, Z., Wang, Y., Cheng, X., Ge, Z.: Colonadapter: Geometry estimation through foundation model adaptation for colonoscopy. IEEE Robotics and Au- tomation Letters (2025)

2025
[7]

In: European conference on computer vision

Kim, S., Park, K., Sohn, K., Lin, S.: Unified depth prediction and intrinsic im- age decomposition from a single image via joint convolutional neural fields. In: European conference on computer vision. pp. 143–159. Springer (2016)

2016
[8]

arXiv preprint arXiv:2508.10893 (2025)

Lan,Y.,Luo,Y.,Hong,F.,Zhou,S.,Chen,H.,Lyu,Z.,Yang,S.,Dai,B.,Loy,C.C., Pan, X.: Stream3r: Scalable sequential 3d reconstruction with causal transformer. arXiv preprint arXiv:2508.10893 (2025)

work page arXiv 2025
[9]

IEEE transactions on medical imaging37(12), 2572–2581 (2018)

Mahmood, F., Chen, R., Durr, N.J.: Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE transactions on medical imaging37(12), 2572–2581 (2018)

2018
[10]

Medical image analysis48, 230–243 (2018)

Mahmood,F.,Durr,N.J.:Deeplearningandconditionalrandomfields-baseddepth estimation and topographical reconstruction from conventional endoscopy. Medical image analysis48, 230–243 (2018)

2018
[11]

Medical Image Analysis96, 103195 (2024)

Rau, A., Bano, S., Jin, Y., Azagra, P., Morlana, J., Kader, R., Sanderson, E., Matuszewski, B.J., Lee, J.Y., Lee, D.J., et al.: Simcol3d—3d reconstruction during colonoscopy challenge. Medical Image Analysis96, 103195 (2024)

2024
[12]

Medical image analysis77, 102338 (2022)

Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Sun, D., Zhang, B.: Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue. Medical image analysis77, 102338 (2022)

2022
[13]

In: 2025 Interna- tional Conference on 3D Vision (3DV)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: 2025 Interna- tional Conference on 3D Vision (3DV). pp. 78–89. IEEE (2025)

2025
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025) 10 L. Shao et al

2025
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

2024
[16]

Medical Image Analysis102, 103534 (2025)

Wang, Z., Zhou, Y., He, S., Li, T., Huang, F., Ding, Q., Feng, X., Liu, M., Li, Q.: Monopcc: Photometric-invariant cycle constraint for monocular depth estimation of endoscopic images. Medical Image Analysis102, 103534 (2025)

2025
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

2024
[18]

arXiv preprint arXiv:2410.03825 (2024)

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825 (2024)

work page arXiv 2024