arxiv: 2604.11585 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.RO

Recognition: unknown

GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth

Krishna Jaganathan , Patricio Vela

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:43 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords geometric prompt learningRGB-D semantic segmentationmissing depthdegraded depthcross-modal adaptationfrozen modeltask-driven prompting

0 comments

The pith

GeomPrompt creates a geometric prompt from RGB alone to serve as the depth channel for frozen RGB-D semantic segmentation models when depth is missing or degraded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeomPrompt as a lightweight module that learns to produce a geometric prompt directly from an RGB image. This prompt is fed into the fourth channel of a pre-trained RGB-D model whose weights stay frozen, and the whole system is trained using only the downstream segmentation loss with no depth labels. The result is better segmentation accuracy than using RGB by itself, plus a separate recovery module that corrects for noisy or incomplete depth. Because the prompt is shaped by the segmentation task rather than depth accuracy, the method stays fast and avoids the cost of running a separate depth estimator. In practice this lets robotic or embodied systems keep their RGB-D pipelines running even when depth sensors fail.

Core claim

GeomPrompt synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model without depth supervision. On SUN RGB-D it raises performance over RGB-only inference by 6.1 mIoU on DFormer and 3.0 mIoU on GeminiFusion while staying competitive with monocular depth estimators yet running at 7.8 ms latency. GeomPrompt-Recovery extends the same idea by predicting a fourth-channel correction that improves robustness under depth corruptions, delivering gains up to 3.6 mIoU.

What carries the argument

GeomPrompt, a lightweight cross-modal adaptation module that generates a task-driven geometric prompt from RGB to populate the depth channel of a frozen segmenter.

If this is right

Segmentation quality improves over RGB-only baselines under missing depth without needing depth ground truth.
A parallel recovery module restores performance when depth is present but degraded.
The approach runs substantially faster than pipelines that first compute monocular depth.
Both modules are trained end-to-end with segmentation supervision alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-guided geometric features extracted from RGB may substitute for explicit depth maps in other RGB-D downstream tasks.
The prompting technique could be applied to additional sensor-failure scenarios such as camera occlusion or lighting changes.
Further tests on real-time robotic platforms would show whether the latency savings translate to practical reliability gains.

Load-bearing premise

A prompt optimized solely for segmentation accuracy on RGB images can recover geometric information that remains useful to the frozen model when depth is absent or corrupted.

What would settle it

Running the trained GeomPrompt on a new dataset containing scene types or depth failure patterns absent from SUN RGB-D training, then measuring whether the mIoU gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.11585 by Krishna Jaganathan, Patricio Vela.

**Figure 1.** Figure 1: GeomPrompt synthesizes a geometric prompt from an RGB image for downstream semantic segmentation on a frozen RGB-D segmenter in the case of missing depth. GeomPrompt-Recovery extends this by allowing for corrections on existing degraded depth inputs. Abstract Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, … view at source ↗

**Figure 2.** Figure 2: Model architecture diagrams. (a) Overview of the GeomPrompt architecture. (b) Depth handling in GeomPrompt-Recovery. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Degraded depth vs. our recovered prompt across severities and degradation types. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: mIoU of training ablations vs. our baseline. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of segmentation outputs and ge [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeomPrompt shows you can train a small RGB-to-depth-prompt module on segmentation loss alone to boost frozen RGB-D models under missing depth, with clear efficiency wins, but the geometric claim rests on thin evidence from one dataset.

read the letter

The core thing here is a lightweight module that turns RGB into a fourth-channel prompt for frozen RGB-D segmenters, trained end-to-end on the segmentation objective with no depth labels or geometric regularizers. It also adds a recovery version for corrupted depth. Both are positioned as task-driven rather than depth-estimation approaches, and the abstract reports +6.1 mIoU and +3.0 mIoU lifts on SUN RGB-D plus much lower latency than monocular depth baselines.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone to serve as the fourth (depth) channel input to a frozen RGB-D semantic segmentation model, without any depth supervision or explicit depth estimation. It also presents GeomPrompt-Recovery for compensating degraded depth inputs. Both modules are trained solely via the downstream segmentation loss. On SUN RGB-D, GeomPrompt yields +6.1 mIoU over RGB-only inference for DFormer and +3.0 mIoU for GeminiFusion, remains competitive with monocular depth estimators, and is substantially faster (7.8 ms latency); GeomPrompt-Recovery provides up to +3.6 mIoU robustness gains under depth corruptions.

Significance. If the central claim holds, the work offers an efficient mechanism for cross-modal compensation in RGB-D perception under unreliable depth, which is valuable for robotics and embodied AI. The reported latency advantage and lack of requirement for depth labels or separate depth estimators are clear practical strengths. Significance is reduced, however, by the absence of evidence that the learned prompt encodes a transferable geometric prior rather than dataset- or model-specific compensations.

major comments (3)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): All quantitative gains (+6.1 / +3.0 mIoU) are measured on the SUN RGB-D test split after training the prompt on the same SUN RGB-D training distribution; no cross-dataset transfer results (e.g., training on SUN RGB-D and evaluating on NYUv2 or ScanNet) are reported, leaving open whether the prompt recovers a generalizable geometric signal or merely fits dataset-specific RGB–label correlations.
[§3 (Method)] §3 (Method): The geometric prompt is defined and optimized exclusively against the segmentation objective on the target data with no auxiliary geometric loss, depth supervision, or independent geometric benchmark; this makes the interpretation that the module “recovers the geometric prior useful for segmentation, rather than estimating depth signals” circular by construction and requires additional disambiguation.
[§4 (Results)] §4 (Results): The abstract and results lack details on training procedure, number of runs, statistical significance of the reported mIoU deltas, and ablations on prompt formulation or injection mechanism; these omissions directly limit verification of the central claim that the synthesized prompt is geometrically meaningful to the frozen segmenter.

minor comments (2)

[Figure 1 and §3.1] Figure 1 and §3.1: The diagram and text describing how the synthesized prompt is concatenated or injected as the fourth channel should explicitly state tensor shapes and whether any normalization or scaling is applied before feeding the frozen model.
[Related Work] Related Work: Consider citing recent prompt-tuning and adapter methods for multimodal segmentation to better situate the novelty of the task-driven geometric formulation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] All quantitative gains (+6.1 / +3.0 mIoU) are measured on the SUN RGB-D test split after training the prompt on the same SUN RGB-D training distribution; no cross-dataset transfer results (e.g., training on SUN RGB-D and evaluating on NYUv2 or ScanNet) are reported, leaving open whether the prompt recovers a generalizable geometric signal or merely fits dataset-specific RGB–label correlations.

Authors: We acknowledge that cross-dataset transfer experiments would provide stronger support for the generalizability of the learned prompt beyond dataset-specific correlations. Our primary focus was on SUN RGB-D as a standard, challenging benchmark for RGB-D segmentation. In the revised manuscript, we will add results training on SUN RGB-D and evaluating on NYUv2 to directly address this point and better substantiate the claim of a transferable geometric prior. revision: yes
Referee: [§3 (Method)] The geometric prompt is defined and optimized exclusively against the segmentation objective on the target data with no auxiliary geometric loss, depth supervision, or independent geometric benchmark; this makes the interpretation that the module “recovers the geometric prior useful for segmentation, rather than estimating depth signals” circular by construction and requires additional disambiguation.

Authors: The absence of auxiliary geometric losses or depth supervision is a deliberate design choice to learn a task-driven prompt that leverages the geometric capabilities already encoded in the frozen RGB-D model, rather than performing explicit depth estimation. The observed gains over RGB-only inference provide empirical support for the utility of the synthesized prompt. To further disambiguate and strengthen the geometric interpretation, we will include additional analysis such as prompt visualizations and comparisons against monocular depth outputs in the revision. revision: partial
Referee: [§4 (Results)] The abstract and results lack details on training procedure, number of runs, statistical significance of the reported mIoU deltas, and ablations on prompt formulation or injection mechanism; these omissions directly limit verification of the central claim that the synthesized prompt is geometrically meaningful to the frozen segmenter.

Authors: We agree that additional experimental details are necessary for rigorous verification. In the revised manuscript, we will expand §4 to include complete training procedure and hyperparameter specifications, results over multiple runs with standard deviations to assess statistical significance of the mIoU improvements, and further ablations on prompt formulation and injection strategies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation trained on segmentation loss with explicit distinction from depth estimation

full rationale

The paper describes GeomPrompt as a lightweight module that generates a prompt for the depth channel of a frozen RGB-D segmenter, trained exclusively via downstream segmentation supervision on SUN RGB-D without any depth ground truth or geometric regularizer. The reported gains (+6.1 mIoU, +3.0 mIoU) are direct empirical measurements of the trained module's effect on the same task and dataset; no derivation chain, equation, or uniqueness claim reduces the prompt to a tautological re-expression of the segmentation loss or to a self-citation. The text explicitly contrasts the approach with depth estimation, and no load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided description. The method is therefore a standard supervised adapter whose validity rests on held-out test performance rather than on any definitional or fitted-input loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that RGB contains sufficient geometric information recoverable via a learned prompt optimized only for segmentation; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption RGB images contain recoverable geometric cues sufficient for segmentation when depth is absent
Implicit in the design of synthesizing a geometric prompt from RGB alone without depth supervision

invented entities (1)

task-driven geometric prompt no independent evidence
purpose: Synthesized fourth-channel input that encodes geometric information useful for segmentation
New module output introduced to stand in for missing depth; no independent evidence provided beyond segmentation performance

pith-pipeline@v0.9.0 · 5537 in / 1338 out tokens · 42440 ms · 2026-05-10T15:43:33.537118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 3 canonical work pages

[1]

Improving 3D Reconstruction Through RGB-D Sensor Noise Modeling.Sensors, 25(3):950, 2025

Fahira Afzal Maken, Sundaram Muthu, Chuong Nguyen, Changming Sun, Jinguang Tong, Shan Wang, Russell Tsuchida, David Howard, Simon Dunstall, and Lars Peters- son. Improving 3D Reconstruction Through RGB-D Sensor Noise Modeling.Sensors, 25(3):950, 2025. 6

2025
[2]

Improving Feature Stability During Upsampling – Spectral Artifacts and the Importance of Spatial Context

Shashank Agnihotri, Julia Grabinski, and Margret Keuper. Improving Feature Stability During Upsampling – Spectral Artifacts and the Importance of Spatial Context. InCom- puter Vision – ECCV 2024, pages 357–376. Springer Nature Switzerland, Cham, 2025. 3

2024
[3]

Barron and Jitendra Malik

Jonathan T. Barron and Jitendra Malik. Shape, albedo, and illumination from a single image of an unknown object. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 334–341, 2012. 7

2012
[4]

Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Trans- former

Minh Bui and Kostas Alexis. Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Trans- former. In2025 IEEE International Conference on Advanced Robotics (ICAR), pages 404–411, San Juan, Argentina, 2025. IEEE. 3

2025
[5]

A Computational Approach to Edge Detection

John Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, PAMI-8(6):679–698, 1986. 7

1986
[6]

Plugging Self- Supervised Monocular Depth into Unsupervised Domain Adaptation for Semantic Segmentation

Adriano Cardace, Luca De Luigi, Pierluigi Zama Ramirez, Samuele Salti, and Luigi Di Stefano. Plugging Self- Supervised Monocular Depth into Unsupervised Domain Adaptation for Semantic Segmentation. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1999–2009, Waikoloa, HI, USA, 2022. IEEE. 3

1999
[7]

Depth Matters: Ex- ploring Deep Interactions of RGB-D for Semantic Segmen- tation in Traffic Scenes

Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, and Guorong Cai. Depth Matters: Ex- ploring Deep Interactions of RGB-D for Semantic Segmen- tation in Traffic Scenes. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7834–7841, Hangzhou, China, 2025. IEEE. 2, 3

2025
[8]

Learning Semantic Segmentation From Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach

Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. Learning Semantic Segmentation From Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach. In2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1841–1850, Long Beach, CA, USA, 2019. IEEE. 3

2019
[9]

Statistical Analysis-Based Error Models for the Microsoft KinectTM Depth Sensor.Sensors, 14(9):17430– 17450, 2014

Benjamin Choo, Michael Landau, Michael DeV ore, and Pe- ter Beling. Statistical Analysis-Based Error Models for the Microsoft KinectTM Depth Sensor.Sensors, 14(9):17430– 17450, 2014. 6

2014
[10]

Learn- ing Depth Estimation for Transparent and Mirror Surfaces

Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano. Learn- ing Depth Estimation for Transparent and Mirror Surfaces. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9210–9221, Paris, France, 2023. IEEE. 6

2023
[11]

Semantic Information for Robot Navigation: A Survey.Applied Sciences, 10(2):497, 2020

Jonathan Crespo, Jose Carlos Castillo, Oscar Martinez Mo- zos, and Ramon Barber. Semantic Information for Robot Navigation: A Survey.Applied Sciences, 10(2):497, 2020. 2

2020
[12]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, Miami, FL, 2009. IEEE. 3

2009
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020. 3

2020
[14]

AsymFormer: Asymmetrical Cross- Modal Representation Learning for Mobile Platform Real- Time RGB-D Semantic Segmentation

Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, and Shengjun Tang. AsymFormer: Asymmetrical Cross- Modal Representation Learning for Mobile Platform Real- Time RGB-D Semantic Segmentation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 7608–7615, Seattle, W A, USA,
[15]

Pereira, Catarina Moreira, Jacinto C

Alexandre Duarte, Francisco Fernandes, Jo ˜ao M. Pereira, Catarina Moreira, Jacinto C. Nascimento, and Joaquim Jorge. Selfredepth: Self-supervised real-time depth restora- tion for consumer-grade sensors.Journal of Real-Time Im- age Processing, 21(4):124, 2024. 6

2024
[16]

Depth Removal Distillation for RGB-D Seman- tic Segmentation

Tiyu Fang, Zhen Liang, Xiuli Shao, Zihao Dong, and Jin- ping Li. Depth Removal Distillation for RGB-D Seman- tic Segmentation. InICASSP 2022 - 2022 IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 2405–2409, Singapore, Singapore,

2022
[17]

Felzenszwalb and Daniel P

Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Distance transforms of sampled functions.Theory of Computing, 8 (1):415–428, 2012. 7

2012
[18]

Wichmann

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 7

2020
[19]

Digging Into Self-Supervised Monocular Depth Estimation

Clement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel Brostow. Digging Into Self-Supervised Monocular Depth Estimation. In2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 3827–3837, Seoul, Korea (South), 2019. IEEE. 2, 3

2019
[20]

Hard Pixel Mining for Depth Privileged Semantic Segmen- tation.IEEE Transactions on Multimedia, 23:3738–3751,

Zhangxuan Gu, Li Niu, Haohua Zhao, and Liqing Zhang. Hard Pixel Mining for Depth Privileged Semantic Segmen- tation.IEEE Transactions on Multimedia, 23:3738–3751,
[21]

Semantic segmentation of RGBD images based on deep depth regression.Pattern Recognition Letters, 109:55–64, 2018

Yanrong Guo and Tao Chen. Semantic segmentation of RGBD images based on deep depth regression.Pattern Recognition Letters, 109:55–64, 2018. 3

2018
[22]

What Can We Learn from Depth Camera Sensor Noise?Sensors, 22(14):5448, 2022

Azmi Haider and Hagit Hel-Or. What Can We Learn from Depth Camera Sensor Noise?Sensors, 22(14):5448, 2022. 2, 6

2022
[23]

Depth Errors Analysis and Correction for Time-of-Flight (ToF) Cameras.Sensors, 17(1):92, 2017

Ying He, Bin Liang, Yu Zou, Jin He, and Jun Yang. Depth Errors Analysis and Correction for Time-of-Flight (ToF) Cameras.Sensors, 17(1):92, 2017. 2, 6

2017
[24]

Improving Semi-Supervised and Domain- Adaptive Semantic Segmentation with Self-Supervised Depth Estimation.International Journal of Computer Vision, 131(8):2070–2096, 2023

Lukas Hoyer, Dengxin Dai, Qin Wang, Yuhua Chen, and Luc Van Gool. Improving Semi-Supervised and Domain- Adaptive Semantic Segmentation with Self-Supervised Depth Estimation.International Journal of Computer Vision, 131(8):2070–2096, 2023. 3 9

2070
[25]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A Versatile Monocular Geomet- ric Foundation Model for Zero-Shot Metric Depth and Sur- face Normal Estimation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 46(12):10579–10596, 2024. 2, 3, 5

2024
[26]

Chang-Hasnain

Xuanlun Huang, Chenyang Wu, Xiaolan Xu, Baishun Wang, Sui Zhang, Chihchiang Shen, Chiennan Yu, Jiaxing Wang, Nan Chi, Shaohua Yu, and Connie J. Chang-Hasnain. Po- larization structured light 3D depth image sensor for scenes with reflective surfaces.Nature Communications, 14(1): 6855, 2023. 2

2023
[27]

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer, 2024

Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, and Xinghao Chen. GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer, 2024. 2, 5

2024
[28]

Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Ryn- son Lau, and Thomas S. Huang. Geometry-Aware Distilla- tion for Indoor Semantic Segmentation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2864–2873, Long Beach, CA, USA, 2019. IEEE. 2, 3

2019
[29]

Deep Learning-Based Monocular Depth Estimation Methods—A State-of-the-Art Review.Sensors, 20(8):2272, 2020

Faisal Khan, Saqib Salahuddin, and Hossein Javidnia. Deep Learning-Based Monocular Depth Estimation Methods—A State-of-the-Art Review.Sensors, 20(8):2272, 2020. 2

2020
[30]

SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion

Juan Lagos and Esa Rahtu. SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion. InProceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pages 155–165, Online Streaming, — Se- lect a Country —, 2022. SCITEPRESS - Science and Tech- nology Publications. 3

2022
[31]

On the Viabil- ity of Monocular Depth Pre-training for Semantic Segmen- tation

Dong Lao, Fengyu Yang, Daniel Wang, Hyoungseob Park, Samuel Lu, Alex Wong, and Stefano Soatto. On the Viabil- ity of Monocular Depth Pre-training for Semantic Segmen- tation. InComputer Vision – ECCV 2024, pages 340–357. Springer Nature Switzerland, Cham, 2025. 3

2024
[32]

SPIGAN: Privileged Adversarial Learning from Simulation,

Kuan-Hui Lee, German Ros, Jie Li, and Adrien Gaidon. SPIGAN: Privileged Adversarial Learning from Simulation,
[33]

Benchmarking Multi-Modal Semantic Segmen- tation Under Sensor Failures: Missing and Noisy Modality Robustness

Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, and Xuming Hu. Benchmarking Multi-Modal Semantic Segmen- tation Under Sensor Failures: Missing and Noisy Modality Robustness. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1567–1577, Nashville, TN, US...

2025
[34]

Transparent Depth Completion Using Segmentation Fea- tures.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 20(12):1–19, 2024

Boqian Liu, Haojie Li, Zhihui Wang, and Tianfan Xue. Transparent Depth Completion Using Segmentation Fea- tures.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 20(12):1–19, 2024. 6

2024
[35]

FCEGNet: Feature calibration and edge-guided MLP de- coder Network for RGB-D semantic segmentation.Com- puter Vision and Image Understanding, 260:104448, 2025

Yiming Lu, Bin Ge, Chenxing Xia, Xu Zhu, Mengge Zhang, Mengya Gao, Ningjie Chen, Jianjun Hu, and Junjie Zhi. FCEGNet: Feature calibration and edge-guided MLP de- coder Network for RGB-D semantic segmentation.Com- puter Vision and Image Understanding, 260:104448, 2025. 2

2025
[36]

Miss- ing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

Harsh Maheshwari, Yen-Cheng Liu, and Zsolt Kira. Miss- ing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation. In2024 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 1009–1019, Waikoloa, HI, USA, 2024. IEEE. 3, 6

2024
[37]

Marr and E

D. Marr and E. Hildreth. Theory of edge detection.Pro- ceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980. 7

1980
[38]

Deconvolution and Checkerboard Artifacts,

Augustus Odena, Vincent Dumoulin, and Chris Olah. De- convolution and Checkerboard Artifacts.Distill, 1(10): 10.23915/distill.00003, 2016. 3

work page doi:10.23915/distill.00003 2016
[39]

UniDepthV2: Universal Monocular Metric Depth Estima- tion Made Simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(3):2354–2367, 2026

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal Monocular Metric Depth Estima- tion Made Simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(3):2354–2367, 2026. 3

2026
[40]

StructScan3D v1: A First RGB-D Dataset for Indoor Building Elements Segmentation and BIM Modeling.Sen- sors, 25(11):3461, 2025

Ishraq Rached, Rafika Hajji, Tania Landes, and Rashid Haf- fadi. StructScan3D v1: A First RGB-D Dataset for Indoor Building Elements Segmentation and BIM Modeling.Sen- sors, 25(11):3461, 2025. 6

2025
[41]

Hamprecht, Yoshua Bengio, and Aaron Courville

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks
[42]

Vi- sion Transformers for Dense Prediction

Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion Transformers for Dense Prediction. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, Montreal, QC, Canada, 2021. IEEE. 3

2021
[43]

Sonia Raychaudhuri and Angel X. Chang. Semantic Map- ping in Indoor Embodied AI – A Survey on Advances, Chal- lenges, and Future Directions, 2025. 2

2025
[44]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A Photorealistic Syn- thetic Dataset for Holistic Indoor Scene Understanding. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 10892–10902, Montreal, QC, Canada,

2021
[45]

Optimal Filters for Extended Optical Flow

Hanno Scharr. Optimal Filters for Extended Optical Flow. In Complex Motion, pages 14–29. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. 7

2007
[46]

Efficient attention vision trans- formers for monocular depth estimation on resource-limited hardware.Scientific Reports, 15(1):24001, 2025

Claudio Schiavella, Lorenzo Cirillo, Lorenzo Papa, Paolo Russo, and Irene Amerini. Efficient attention vision trans- formers for monocular depth estimation on resource-limited hardware.Scientific Reports, 15(1):24001, 2025. 6

2025
[47]

Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments

Daniel Seichter, Sohnke Benedikt Fischedick, Mona Kohler, and Horst-Michael Grob. Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. In2022 International Joint Conference on Neural Networks (IJCNN), pages 1–10, Padua, Italy, 2022. IEEE. 2

2022
[48]

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Leon Sick, Dominik Engel, Pedro Hermosilla, and Timo Ropinski. Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3637–3646, Seattle, W A, USA,
[49]

Lichtenberg, and Jianxiong Xiao

Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. SUN RGB-D: A RGB-D scene understanding benchmark 10 suite. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 567–576, Boston, MA, USA, 2015. IEEE. 5

2015
[50]

Inpaint- ing of Missing Values in the Kinect Sensor’s Depth Maps Based on Background Estimates.IEEE Sensors Journal, 14 (4):1107–1116, 2014

Martin Stommel, Michael Beetz, and Weiliang Xu. Inpaint- ing of Missing Values in the Kinect Sensor’s Depth Maps Based on Background Estimates.IEEE Sensors Journal, 14 (4):1107–1116, 2014. 2

2014
[51]

A Survey of Object Goal Navigation.IEEE Transactions on Automation Science and Engineering, 22:2292–2308, 2025

Jingwen Sun, Jing Wu, Ze Ji, and Yu-Kun Lai. A Survey of Object Goal Navigation.IEEE Transactions on Automation Science and Engineering, 22:2292–2308, 2025. 2

2025
[52]

Semantic Segmentation Leveraging Simultaneous Depth Estimation.Sensors, 21(3):690, 2021

Wenbo Sun, Zhi Gao, Jinqiang Cui, Bharath Ramesh, Bin Zhang, and Ziyao Li. Semantic Segmentation Leveraging Simultaneous Depth Estimation.Sensors, 21(3):690, 2021. 3

2021
[53]

DADA: Depth-Aware Domain Adaptation in Semantic Segmentation

Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Perez. DADA: Depth-Aware Domain Adaptation in Semantic Segmentation. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7363–7372, Seoul, Korea (South), 2019. IEEE. 3

2019
[54]

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan, Pingping Zhang, Yuhao Wang, Silong Yong, Si- mon Stepputtis, Katia Sycara, and Yaqi Xie. Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1734–1744, Tucson, AZ, USA, 2025. IEEE. 2

2025
[55]

A brief survey on RGB-D semantic segmentation us- ing deep learning.Displays, 70:102080, 2021

Changshuo Wang, Chen Wang, Weijun Li, and Haining Wang. A brief survey on RGB-D semantic segmentation us- ing deep learning.Displays, 70:102080, 2021. 2

2021
[56]

Deep Multimodal Fusion by Channel Exchanging, 2020

Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep Multimodal Fusion by Channel Exchanging, 2020. 2

2020
[57]

Multimodal Token Fusion for Vision Transformers, 2022

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. Multimodal Token Fusion for Vision Transformers, 2022. 2

2022
[58]

One-stage Modality Distillation for Incomplete Multimodal Learning, 2023

Shicai Wei, Yang Luo, and Chunbo Luo. One-stage Modality Distillation for Incomplete Multimodal Learning, 2023. 6

2023
[59]

Suhan Woo, Junhyuk Hyun, Suhyeon Lee, and Euntai Kim. Real-time RGB-D Semantic Segmentation With Scale- invariant Depth Encoding and Noise-robust Fusion.Interna- tional Journal of Control, Automation and Systems, 23(12): 3649–3661, 2025. 2, 3

2025
[60]

Transformer fu- sion for indoor RGB-D semantic segmentation.Computer Vision and Image Understanding, 249:104174, 2024

Zongwei Wu, Zhuyun Zhou, Guillaume Allibert, Christophe Stolz, C´edric Demonceaux, and Chao Ma. Transformer fu- sion for indoor RGB-D semantic segmentation.Computer Vision and Image Understanding, 249:104174, 2024. 2

2024
[61]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. SegFormer: Simple and Effi- cient Design for Semantic Segmentation with Transformers,
[62]

Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation

Xinhua Xu, Jinfu Liu, and Hong Liu. Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation. Electronics, 12(18):3943, 2023. 2

2023
[63]

Xinhua Xu, Hong Liu, Jianbing Wu, and Jinfu Liu. PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Seg- mentation Based in Complex Indoor Scenes.Proceedings of the AAAI Conference on Artificial Intelligence, 39(9):8969– 8977, 2025. 3

2025
[64]

Depth Any- thing V2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Any- thing V2, 2024. 2, 3, 5

2024
[65]

DFormer: Rethinking RGBD Rep- resentation Learning for Semantic Segmentation, 2023

Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, and Qibin Hou. DFormer: Rethinking RGBD Rep- resentation Learning for Semantic Segmentation, 2023. 2, 5

2023
[66]

DFormerv2: Geometry Self-Attention for RGBD Se- mantic Segmentation

Bo-Wen Yin, Jiao-Long Cao, Ming-Ming Cheng, and Qibin Hou. DFormerv2: Geometry Self-Attention for RGBD Se- mantic Segmentation. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19345–19355, Nashville, TN, USA, 2025. IEEE. 2

2025
[67]

CMX: Cross-Modal Fu- sion for RGB-X Semantic Segmentation With Transformers

Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. CMX: Cross-Modal Fu- sion for RGB-X Semantic Segmentation With Transformers. IEEE Transactions on Intelligent Transportation Systems, 24 (12):14679–14694, 2023. 2

2023
[68]

Delivering Arbitrary-Modal Semantic Segmentation

Jiaming Zhang, Ruiping Liu, Hao Shi, Kailun Yang, Si- mon Reiß, Kunyu Peng, Haodong Fu, Kaiwei Wang, and Rainer Stiefelhagen. Delivering Arbitrary-Modal Semantic Segmentation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1136–1147, Vancouver, BC, Canada, 2023. IEEE. 3

2023
[69]

Survey on monocular metric depth estimation.Computers, 14(11),

Jiuling Zhang, Yurong Wu, and Huilong Jiang. Survey on monocular metric depth estimation.Computers, 14(11),
[70]

https: //arxiv.org/abs/1904.11486

Richard Zhang. Making convolutional networks shift- invariant again.CoRR, abs/1904.11486, 2019. 4

work page arXiv 1904
[71]

Centering the Value of Every Modality: Towards Efficient and Resilient Modality-Agnostic Semantic Segmentation

Xu Zheng, Yuanhuiyi Lyu, Jiazhou Zhou, and Lin Wang. Centering the Value of Every Modality: Towards Efficient and Resilient Modality-Agnostic Semantic Segmentation. In Computer Vision – ECCV 2024, pages 192–212. Springer Nature Switzerland, Cham, 2025. 3

2024
[72]

Attention- based fusion network for RGB-D semantic segmentation

Li Zhong, Chi Guo, Jiao Zhan, and JingYi Deng. Attention- based fusion network for RGB-D semantic segmentation. Neurocomputing, 608:128371, 2024. 2, 3

2024
[73]

RGB-D Local Implicit Function for Depth Completion of Transparent Objects, 2021

Luyang Zhu, Arsalan Mousavian, Yu Xiang, Hammad Mazhar, Jozef van Eenbergen, Shoubhik Debnath, and Dieter Fox. RGB-D Local Implicit Function for Depth Completion of Transparent Objects, 2021. arXiv:2104.00622. 6 11

work page arXiv 2021