AISPO: Enhancing Depth Reliability for Robotic Manipulation of Non-Lambertian Objects via Affine-Invariant Shape Prior

Hongyu Yu; Hua Chen; Hyung Jin Chang; Kun Zhang; Linfang Zheng; Wei Zhang; Zhiming Chen

arxiv: 2606.25503 · v1 · pith:PDSEQ6OBnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV

AISPO: Enhancing Depth Reliability for Robotic Manipulation of Non-Lambertian Objects via Affine-Invariant Shape Prior

Zhiming Chen , Linfang Zheng , Kun Zhang , Hyung Jin Chang , Wei Zhang , Hongyu Yu , Hua Chen This is my paper

Pith reviewed 2026-06-25 21:23 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords depth completionrobotic manipulationnon-Lambertian objectsaffine-invariant shape priortransparent objectsRGB-D feature fusiongeometric consistency

0 comments

The pith

AISPO combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to produce reliable depth maps for robotic manipulation of non-Lambertian objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a depth completion framework called AISPO to address failures in RGB-D sensors when viewing transparent or highly specular surfaces. These failures produce corrupted or missing depth values that lead to invalid grasp poses and execution errors in robotic manipulation. AISPO fuses RGB and depth features at multiple scales while applying an affine-invariant shape prior to enforce geometric consistency in the output depth maps. The authors argue that this yields depth estimates with greater physical plausibility and structural integrity than methods optimized only for average accuracy metrics. Benchmark tests and real grasping trials show the resulting depth supports higher success rates, especially on transparent objects where prior methods often fail to generate usable maps.

Core claim

AISPO is a depth completion framework that improves depth reliability for manipulation in challenging sensing conditions. AISPO combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to enforce geometric consistency and mitigate catastrophic depth failures. Unlike methods that focus primarily on average depth accuracy, our approach emphasizes physical plausibility and structural integrity of the predicted depth maps. Extensive benchmark evaluations demonstrate competitive performance and strong generalization to unseen objects and novel scenes. Real-world grasping experiments further show that enhanced depth reliability significantly improves manipulation success rate

What carries the argument

The affine-invariant shape prior combined with multi-scale RGB-D feature fusion: a geometric constraint on depth structure that remains consistent under affine transformations, used to enforce physical plausibility in completed depth maps.

If this is right

Depth maps gain structural integrity that reduces invalid grasp poses during motion planning.
Manipulation success rates rise on non-Lambertian objects, particularly transparent surfaces.
The framework generalizes to unseen objects and novel scenes without additional tuning.
Catastrophic depth failures are reduced through enforced geometric consistency across scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior could be tested on depth data from other sensor modalities such as structured light to check transferability.
If the geometric consistency holds, the method might lower the hardware cost of reliable robotic vision systems by relying on standard RGB-D cameras.
Applications beyond grasping, such as obstacle avoidance in cluttered scenes, could benefit if the depth maps remain stable under viewpoint changes.

Load-bearing premise

The affine-invariant shape prior will enforce geometric consistency and produce physically usable depth estimates that generalize to real robotic manipulation without introducing artifacts or requiring scene-specific tuning.

What would settle it

Real-world grasping trials on transparent objects in which AISPO depth maps produce the same or lower success rates than standard depth completion baselines would falsify the claim of improved reliability for manipulation.

Figures

Figures reproduced from arXiv: 2606.25503 by Hongyu Yu, Hua Chen, Hyung Jin Chang, Kun Zhang, Linfang Zheng, Wei Zhang, Zhiming Chen.

**Figure 1.** Figure 1: Unlocking reliable depth sensing for robotic manipulation in the real world. AISPO completes missing depth for challenging non-Lambertian objects—such as transparent and reflective items—by combining learned shape priors and multi-modal visual cues. Shown: (a) real-world robotic scene, (b) input RGB, (c) detected non-Lambertian object mask, (d) raw depth with large missing regions, (e) our completed depth… view at source ↗

**Figure 2.** Figure 2: Overview of the shape prior auto-encoder architecture. Encoder εθ processes the RGB image and the nonLambertian object mask to extract the shape prior latent code z. Subsequently, decoder Dθ decodes z into an affine-invariant shape representation of the non-Lambertian object. Notably, the output contains only the object-level shape prior, with background information fully excluded. As illustrated in [PI… view at source ↗

**Figure 3.** Figure 3: Framework Overview. Our framework extracts complementary cues via three parallel encoders. The RGB encoder ϕθ processes image I for color and texture. Concurrently, a pretrained shape-prior encoder εθ takes I and a non-Lambertian mask (from Grounded-SAM2 [36]) to capture object-shape features. A Swin-Transformer-based depth encoder ψθ handles the raw depth input. These streams yield multi-scale intermediat… view at source ↗

**Figure 4.** Figure 4: Real-world qualitative results on non-Lambertian objects in robotic manipulation environment. Using only synthetic data (DREDS-CatKnown [11]) for training the framework, our model generalizes to real household objects with severe depth corruption (e.g., reflective cans, water-filled bottles, transparent cups). It recovers dense depth and complete 3D geometry, accurately reconstructing fine structures like … view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison on the DREDS-CatKnown dataset [11]. Each column (from left to right) shows the RGB image, the predictions from the baselines and our method. RGB PromptDA DA2 DFNet SwinDRNet DA3* Depth Pro* Pi3 Ours [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Zeroshot qualitative comparison on the ClearGrasp dataset [6]. Models trained on the DREDS-CatKnown dataset [11] are directly evaluated on the ClearGrasp dataset. RGB Raw Ours GT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of our method on the STDCatNovel dataset [11]. For better visualization, we crop our prediction to align with the ground-truth missing region. RGB Raw Ours GT [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on ClearPose [45]. Columns show RGB input, raw depth, our prediction, and ground truth. for DepthPro. As reported in Table V, our method achieves 28.79 ms per frame, supporting real-time operation in robotic grasping scenarios, providing a favorable trade-off between inference speed and depth estimation performance. Method SwinDR DFNet DA2(vitl) DA3(large) Pi3 Depth-Pro Ours Infer Speed… view at source ↗

read the original abstract

Reliable depth perception is critical for robotic manipulation, especially for non-Lambertian objects such as transparent or highly specular surfaces, where raw depth measurements are often corrupted or missing. These failures frequently propagate to motion planning, resulting in invalid grasp poses and execution errors. We propose AISPO, a depth completion framework that improves depth reliability for manipulation in challenging sensing conditions. AISPO combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to enforce geometric consistency and mitigate catastrophic depth failures. Unlike methods that focus primarily on average depth accuracy, our approach emphasizes physical plausibility and structural integrity of the predicted depth maps. Extensive benchmark evaluations demonstrate competitive performance and strong generalization to unseen objects and novel scenes. Real-world grasping experiments further show that enhanced depth reliability significantly improves manipulation success rates, particularly for transparent objects where many existing methods fail to produce physically usable depth estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AISPO adds an affine-invariant shape prior to multi-scale RGB-D fusion for depth on shiny objects, but the abstract supplies no numbers or comparisons so the actual gain is impossible to judge.

read the letter

The main takeaway is that this work targets depth failures on transparent and specular surfaces by layering an affine-invariant shape prior onto multi-scale feature fusion, then shows the resulting depth helps real grasping. That focus on physical usability over mean error is a reasonable priority for manipulation tasks.

What the paper does well is identify a concrete failure mode in robotic perception and tie the method directly to downstream success rates. Mentioning both benchmark generalization and physical robot trials gives the claim some grounding even if the numbers are missing here.

The soft spots are straightforward. The abstract gives no metrics, no baselines, no ablation on the prior itself, and no implementation details, so there is no way to tell whether the prior is doing real work or whether the gains come from better fusion or different training. The assumption that an affine-invariant prior will produce artifact-free depth across novel scenes without scene-specific tuning is stated but not evidenced. Without the full methods or results sections it is hard to know if the approach is new or just a modest extension of existing completion networks.

This is for roboticists who already work on depth estimation for grasping and want ideas for non-Lambertian cases. A reader in that niche might pick up the emphasis on structural integrity, but most others will wait for the numbers.

I would send it to peer review. The problem is practical, the proposed fix is testable, and the real-robot experiments are the right kind of evidence even if they need more scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper proposes AISPO, a depth completion framework for robotic manipulation of non-Lambertian objects (e.g., transparent or specular surfaces). It combines multi-scale RGB-D feature fusion with an affine-invariant shape prior to enforce geometric consistency, mitigate depth failures, and prioritize physical plausibility over average accuracy. The manuscript claims competitive benchmark performance with strong generalization to unseen objects/scenes, plus improved real-world grasping success rates (especially for transparent objects) via enhanced depth reliability.

Significance. If the empirical claims hold, the work could meaningfully advance reliable depth estimation for robotics in challenging sensing conditions, where standard methods often fail to produce usable outputs. The focus on structural integrity and physical plausibility, along with the affine-invariant prior, represents a targeted design choice that aligns with manipulation needs rather than generic depth metrics.

major comments (2)

[Abstract] Abstract: The central claims of 'competitive performance', 'strong generalization to unseen objects and novel scenes', and 'significantly improves manipulation success rates' are asserted without any supporting quantitative metrics, baselines, ablation studies, or result tables. This absence is load-bearing for evaluating whether the method delivers on its promises.
The manuscript does not provide evidence addressing the risk that the affine-invariant shape prior may introduce artifacts or require scene-specific tuning, which is central to the weakest assumption underlying the generalization and grasping claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the abstract and add discussion on the shape prior.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'competitive performance', 'strong generalization to unseen objects and novel scenes', and 'significantly improves manipulation success rates' are asserted without any supporting quantitative metrics, baselines, ablation studies, or result tables. This absence is load-bearing for evaluating whether the method delivers on its promises.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript includes benchmark tables, ablation studies, and grasping success rates in Sections 4 and 5. In revision we will update the abstract to include key metrics (e.g., depth completion errors on transparent objects and grasping success improvements) drawn from those sections. revision: yes
Referee: The manuscript does not provide evidence addressing the risk that the affine-invariant shape prior may introduce artifacts or require scene-specific tuning, which is central to the weakest assumption underlying the generalization and grasping claims.

Authors: We acknowledge the concern. Section 3.2 describes the prior as parameter-free and affine-invariant to promote generalization, and Section 4 reports results on unseen objects and scenes. However, the manuscript does not explicitly analyze potential artifacts or tuning sensitivity. We will add a limitations paragraph and supporting qualitative examples in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and described claims present AISPO as a depth completion framework that fuses multi-scale RGB-D features with an affine-invariant shape prior to enforce geometric consistency. No load-bearing steps reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method is framed as a combination of standard techniques with empirical validation on benchmarks and real-world grasping, without any quoted equations or premises that equate predictions to their own inputs. This is a self-contained empirical proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only view limits the ledger. The central innovation appears to rest on an introduced shape prior whose independent grounding is not shown.

invented entities (1)

Affine-invariant shape prior no independent evidence
purpose: Enforce geometric consistency and mitigate depth failures in non-Lambertian scenes
Presented as the key mechanism for physical plausibility; no independent evidence or derivation supplied in abstract.

pith-pipeline@v0.9.1-grok · 5699 in / 1095 out tokens · 32453 ms · 2026-06-25T21:23:29.559733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,

Z. Jiang, Y . Zhu, M. Svetlik, K. Fang, and Y . Zhu, “Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,”ArXiv, vol. abs/2104.01542, 2021. 8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2026

work page arXiv 2021
[2]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” inNeurIPS Datasets and Benchmarks, 2021

2021
[3]

Iterative vision-and-language navigation,

J. Krantz, S. Banerjee, W. Zhu, J. J. Corso, P. Anderson, S. Lee, and J. Thomason, “Iterative vision-and-language navigation,”2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 921–14 930, 2022

2023
[4]

Learning depth completion of transparent objects using augmented unpaired data,

F. Erich, B. Leme, N. Ando, R. Hanai, and Y . Domae, “Learning depth completion of transparent objects using augmented unpaired data,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023

2023
[5]

Said-nerf: Segmentation-aided nerf for depth completion of transparent objects,

A. Ummadisingu, J. Choi, K. Yamane, S. Masuda, N. Fukaya, and K. Takahashi, “Said-nerf: Segmentation-aided nerf for depth completion of transparent objects,”2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7535–7542, 2024

2024
[6]

Clear grasp: 3d shape estimation of transparent objects for manipulation,

S. S. Sajjan, M. J. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song, “Clear grasp: 3d shape estimation of transparent objects for manipulation,”2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3634–3642, 2019

2020
[7]

Rgb-d local implicit function for depth completion of transparent objects,

L. Zhu, A. Mousavian, Y . Xiang, H. Mazhar, J. van Eenbergen, S. Deb- nath, and D. Fox, “Rgb-d local implicit function for depth completion of transparent objects,”2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4647–4656, 2021

2021
[8]

Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,

H. Fang, H. Fang, S. Xu, and C. Lu, “Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,” IEEE Robotics and Automation Letters, vol. PP, pp. 1–8, 2022

2022
[9]

Residual- nerf: Learning residual nerfs for transparent object manipulation,

B. P. Duisterhof, Y . Mao, S. H. Teng, and J. Ichnowski, “Residual- nerf: Learning residual nerfs for transparent object manipulation,”2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13 918–13 924, 2024

2024
[10]

Seeing glass: Joint point cloud and depth completion for transparent objects,

H. Xu, Y . R. Wang, S. Eppel, A. Aspuru-Guzik, F. Shkurti, and A. Garg, “Seeing glass: Joint point cloud and depth completion for transparent objects,” inConference on Robot Learning, 2021

2021
[11]

Domain randomization-enhanced depth simulation and restoration for perceiving and grasping specular and transparent objects,

Q. Dai, J. Zhang, Q. Li, T. Wu, H. Dong, Z. Liu, P. Tan, and H. Wang, “Domain randomization-enhanced depth simulation and restoration for perceiving and grasping specular and transparent objects,” inEuropean Conference on Computer Vision (ECCV), 2022

2022
[12]

Tode-trans: Transparent object depth estimation with transformer,

K. Chen, S. Wang, B. Xia, D. Li, Z. Kan, and B. Li, “Tode-trans: Transparent object depth estimation with transformer,”2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pp. 4880– 4886, 2022

2023
[13]

Swin transformer: Hierarchical vision transformer using shifted win- dows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted win- dows,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021

2021
[15]

Midas v3.1 - a model zoo for robust monocular relative depth estimation,

R. Birkl, D. Wofk, and M. M ¨uller, “Midas v3.1 - a model zoo for robust monocular relative depth estimation,”ArXiv, vol. abs/2307.14460, 2023

work page arXiv 2023
[16]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[17]

Zest: Zero-shot material transfer from a single image,

T.-Y . Cheng, P. Sharma, A. Markham, N. Trigoni, and V . Jampani, “Zest: Zero-shot material transfer from a single image,” inEuropean Conference on Computer Vision, 2024

2024
[18]

Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” ArXiv, vol. abs/2404.07199, 2024

work page arXiv 2024
[19]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

M. Hu, W. Yin, C. X. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 10 579–10 596, 2024

2024
[20]

Prompting depth anything for 4k resolution accurate metric depth estimation,

H. Lin, S. Peng, J. Chen, S. Peng, J. Sun, M. Liu, H. Bao, J. Feng, X. Zhou, and B. Kang, “Prompting depth anything for 4k resolution accurate metric depth estimation,” 2024

2024
[21]

Depth pro: Sharp monocular metric depth in less than a second,

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” inInternational Conference on Learning Represen- tations, 2025

2025
[22]

Monoc- ular depth estimation and segmentation for transparent object with iter- ative semantic and geometric fusion,

J. Liu, H. Ma, Y . Guo, Y . Zhao, C. Zhang, W. Sui, and W. Zou, “Monoc- ular depth estimation and segmentation for transparent object with iter- ative semantic and geometric fusion,”arXiv preprint arXiv:2502.14616, 2025

work page arXiv 2025
[23]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “pi3: Permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo,

T. Kollar, M. Laskey, K. Stone, B. Thananjeyan, and M. Tjersland, “Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo,”ArXiv, vol. abs/2106.16118, 2021

work page arXiv 2021
[26]

D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,

S. Wei, H. Geng, J. Chen, C. Deng, W. Cui, C. Zhao, X. Fang, L. J. Guibas, and H. Wang, “D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,” inConference on Robot Learning, 2024

2024
[27]

Denoising Diffusion Probabilistic Models

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” ArXiv, vol. abs/2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[28]

Mvtrans: Multi-view perception of transparent objects,

Y . R. Wang, Y . Zhao, H. Xu, S. Eppel, A. Aspuru-Guzik, F. Shkurti, and A. Garg, “Mvtrans: Multi-view perception of transparent objects,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3771–3778, 2023

2023
[29]

Cleardepth: Efficient stereo perception of transparent objects for robotic manipulation

K. Bai, H. Zeng, L. Zhang, Y . Liu, H. Xu, Z. Chen, J. Zhang, and C. Parameters, “Cleardepth: Efficient stereo perception of transparent objects for robotic manipulation.”
[30]

Stereoanything: Advanced zero-shot stereo imaging for robotic grasp detection with transparent objects

K. Bai, L. Zhang, Y . Liu, Z. Chen, and J. Zhang, “Stereoanything: Advanced zero-shot stereo imaging for robotic grasp detection with transparent objects.”IEEE transactions on cybernetics, vol. PP, 2026

2026
[31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf,”Communications of the ACM, vol. 65, pp. 99 – 106, 2020

2020
[32]

Dex-nerf: Us- ing a neural radiance field to grasp transparent objects,

J. Ichnowski, Y . Avigal, J. Kerr, and K. Goldberg, “Dex-nerf: Us- ing a neural radiance field to grasp transparent objects,”ArXiv, vol. abs/2110.14217, 2021

work page arXiv 2021
[33]

Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,

J. Kerr, L. Fu, H. Huang, Y . Avigal, M. Tancik, J. Ichnowski, A. Kanazawa, and K. Goldberg, “Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,” inConference on Robot Learning, 2022

2022
[34]

Instant neural graphics primitives with a multiresolution hash encoding,

T. M ¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022

2022
[35]

Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,

Q. Dai, Y . Zhu, Y . Geng, C. Ruan, J. Zhang, and H. Wang, “Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 1757–1763, 2022

2023
[36]

Grounded-sam-2,

IDEA-Research, “Grounded-sam-2,” https://github.com/ IDEA-Research/Grounded-SAM-2, 2024, accessed: 2024-08-07

2024
[37]

Computer vision and applications: a guide for students and practitioners,

B. J ¨ahne and H. W. Haussecker, “Computer vision and applications: a guide for students and practitioners,”Journal of Electronic Imaging, vol. 11, pp. 115–115, 2000

2000
[38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Q. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 159–12 168, 2021

2021
[41]

Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,

G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,”2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5168–5177, 2016

2017
[42]

Monoc- ular relative depth perception with web stereo data supervision,

K. Xian, C. Shen, Z. CAO, H. Lu, Y . Xiao, R. Li, and Z. Luo, “Monoc- ular relative depth perception with web stereo data supervision,”2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 311–320, 2018

2018
[43]

ShapeNet: An Information-Rich 3D Model Repository

A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,”ArXiv, vol. abs/1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[44]

Moveit: Motion planning framework for robotics,

M. Community, “Moveit: Motion planning framework for robotics,” https://github.com/moveit/moveit, 2025, accessed: 2025-04-07

2025
[45]

Clearpose: Large-scale transparent object dataset and benchmark,

X. Chen, H. Zhang, Z. Yu, A. Opipari, and O. C. Jenkins, “Clearpose: Large-scale transparent object dataset and benchmark,” inEuropean Conference on Computer Vision, 2022

2022

[1] [1]

Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,

Z. Jiang, Y . Zhu, M. Svetlik, K. Fang, and Y . Zhu, “Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,”ArXiv, vol. abs/2104.01542, 2021. 8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2026

work page arXiv 2021

[2] [2]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” inNeurIPS Datasets and Benchmarks, 2021

2021

[3] [3]

Iterative vision-and-language navigation,

J. Krantz, S. Banerjee, W. Zhu, J. J. Corso, P. Anderson, S. Lee, and J. Thomason, “Iterative vision-and-language navigation,”2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 921–14 930, 2022

2023

[4] [4]

Learning depth completion of transparent objects using augmented unpaired data,

F. Erich, B. Leme, N. Ando, R. Hanai, and Y . Domae, “Learning depth completion of transparent objects using augmented unpaired data,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023

2023

[5] [5]

Said-nerf: Segmentation-aided nerf for depth completion of transparent objects,

A. Ummadisingu, J. Choi, K. Yamane, S. Masuda, N. Fukaya, and K. Takahashi, “Said-nerf: Segmentation-aided nerf for depth completion of transparent objects,”2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7535–7542, 2024

2024

[6] [6]

Clear grasp: 3d shape estimation of transparent objects for manipulation,

S. S. Sajjan, M. J. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song, “Clear grasp: 3d shape estimation of transparent objects for manipulation,”2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3634–3642, 2019

2020

[7] [7]

Rgb-d local implicit function for depth completion of transparent objects,

L. Zhu, A. Mousavian, Y . Xiang, H. Mazhar, J. van Eenbergen, S. Deb- nath, and D. Fox, “Rgb-d local implicit function for depth completion of transparent objects,”2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4647–4656, 2021

2021

[8] [8]

Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,

H. Fang, H. Fang, S. Xu, and C. Lu, “Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,” IEEE Robotics and Automation Letters, vol. PP, pp. 1–8, 2022

2022

[9] [9]

Residual- nerf: Learning residual nerfs for transparent object manipulation,

B. P. Duisterhof, Y . Mao, S. H. Teng, and J. Ichnowski, “Residual- nerf: Learning residual nerfs for transparent object manipulation,”2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13 918–13 924, 2024

2024

[10] [10]

Seeing glass: Joint point cloud and depth completion for transparent objects,

H. Xu, Y . R. Wang, S. Eppel, A. Aspuru-Guzik, F. Shkurti, and A. Garg, “Seeing glass: Joint point cloud and depth completion for transparent objects,” inConference on Robot Learning, 2021

2021

[11] [11]

Domain randomization-enhanced depth simulation and restoration for perceiving and grasping specular and transparent objects,

Q. Dai, J. Zhang, Q. Li, T. Wu, H. Dong, Z. Liu, P. Tan, and H. Wang, “Domain randomization-enhanced depth simulation and restoration for perceiving and grasping specular and transparent objects,” inEuropean Conference on Computer Vision (ECCV), 2022

2022

[12] [12]

Tode-trans: Transparent object depth estimation with transformer,

K. Chen, S. Wang, B. Xia, D. Li, Z. Kan, and B. Li, “Tode-trans: Transparent object depth estimation with transformer,”2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pp. 4880– 4886, 2022

2023

[13] [13]

Swin transformer: Hierarchical vision transformer using shifted win- dows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted win- dows,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021

2021

[14] [15]

Midas v3.1 - a model zoo for robust monocular relative depth estimation,

R. Birkl, D. Wofk, and M. M ¨uller, “Midas v3.1 - a model zoo for robust monocular relative depth estimation,”ArXiv, vol. abs/2307.14460, 2023

work page arXiv 2023

[15] [16]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[16] [17]

Zest: Zero-shot material transfer from a single image,

T.-Y . Cheng, P. Sharma, A. Markham, N. Trigoni, and V . Jampani, “Zest: Zero-shot material transfer from a single image,” inEuropean Conference on Computer Vision, 2024

2024

[17] [18]

Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” ArXiv, vol. abs/2404.07199, 2024

work page arXiv 2024

[18] [19]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

M. Hu, W. Yin, C. X. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 10 579–10 596, 2024

2024

[19] [20]

Prompting depth anything for 4k resolution accurate metric depth estimation,

H. Lin, S. Peng, J. Chen, S. Peng, J. Sun, M. Liu, H. Bao, J. Feng, X. Zhou, and B. Kang, “Prompting depth anything for 4k resolution accurate metric depth estimation,” 2024

2024

[20] [21]

Depth pro: Sharp monocular metric depth in less than a second,

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” inInternational Conference on Learning Represen- tations, 2025

2025

[21] [22]

Monoc- ular depth estimation and segmentation for transparent object with iter- ative semantic and geometric fusion,

J. Liu, H. Ma, Y . Guo, Y . Zhao, C. Zhang, W. Sui, and W. Zou, “Monoc- ular depth estimation and segmentation for transparent object with iter- ative semantic and geometric fusion,”arXiv preprint arXiv:2502.14616, 2025

work page arXiv 2025

[22] [23]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “pi3: Permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo,

T. Kollar, M. Laskey, K. Stone, B. Thananjeyan, and M. Tjersland, “Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo,”ArXiv, vol. abs/2106.16118, 2021

work page arXiv 2021

[25] [26]

D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,

S. Wei, H. Geng, J. Chen, C. Deng, W. Cui, C. Zhao, X. Fang, L. J. Guibas, and H. Wang, “D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,” inConference on Robot Learning, 2024

2024

[26] [27]

Denoising Diffusion Probabilistic Models

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” ArXiv, vol. abs/2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[27] [28]

Mvtrans: Multi-view perception of transparent objects,

Y . R. Wang, Y . Zhao, H. Xu, S. Eppel, A. Aspuru-Guzik, F. Shkurti, and A. Garg, “Mvtrans: Multi-view perception of transparent objects,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3771–3778, 2023

2023

[28] [29]

Cleardepth: Efficient stereo perception of transparent objects for robotic manipulation

K. Bai, H. Zeng, L. Zhang, Y . Liu, H. Xu, Z. Chen, J. Zhang, and C. Parameters, “Cleardepth: Efficient stereo perception of transparent objects for robotic manipulation.”

[29] [30]

Stereoanything: Advanced zero-shot stereo imaging for robotic grasp detection with transparent objects

K. Bai, L. Zhang, Y . Liu, Z. Chen, and J. Zhang, “Stereoanything: Advanced zero-shot stereo imaging for robotic grasp detection with transparent objects.”IEEE transactions on cybernetics, vol. PP, 2026

2026

[30] [31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf,”Communications of the ACM, vol. 65, pp. 99 – 106, 2020

2020

[31] [32]

Dex-nerf: Us- ing a neural radiance field to grasp transparent objects,

J. Ichnowski, Y . Avigal, J. Kerr, and K. Goldberg, “Dex-nerf: Us- ing a neural radiance field to grasp transparent objects,”ArXiv, vol. abs/2110.14217, 2021

work page arXiv 2021

[32] [33]

Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,

J. Kerr, L. Fu, H. Huang, Y . Avigal, M. Tancik, J. Ichnowski, A. Kanazawa, and K. Goldberg, “Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,” inConference on Robot Learning, 2022

2022

[33] [34]

Instant neural graphics primitives with a multiresolution hash encoding,

T. M ¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022

2022

[34] [35]

Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,

Q. Dai, Y . Zhu, Y . Geng, C. Ruan, J. Zhang, and H. Wang, “Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,”2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 1757–1763, 2022

2023

[35] [36]

Grounded-sam-2,

IDEA-Research, “Grounded-sam-2,” https://github.com/ IDEA-Research/Grounded-SAM-2, 2024, accessed: 2024-08-07

2024

[36] [37]

Computer vision and applications: a guide for students and practitioners,

B. J ¨ahne and H. W. Haussecker, “Computer vision and applications: a guide for students and practitioners,”Journal of Electronic Imaging, vol. 11, pp. 115–115, 2000

2000

[37] [38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Q. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [40]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 159–12 168, 2021

2021

[40] [41]

Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,

G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,”2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5168–5177, 2016

2017

[41] [42]

Monoc- ular relative depth perception with web stereo data supervision,

K. Xian, C. Shen, Z. CAO, H. Lu, Y . Xiao, R. Li, and Z. Luo, “Monoc- ular relative depth perception with web stereo data supervision,”2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 311–320, 2018

2018

[42] [43]

ShapeNet: An Information-Rich 3D Model Repository

A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,”ArXiv, vol. abs/1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[43] [44]

Moveit: Motion planning framework for robotics,

M. Community, “Moveit: Motion planning framework for robotics,” https://github.com/moveit/moveit, 2025, accessed: 2025-04-07

2025

[44] [45]

Clearpose: Large-scale transparent object dataset and benchmark,

X. Chen, H. Zhang, Z. Yu, A. Opipari, and O. C. Jenkins, “Clearpose: Large-scale transparent object dataset and benchmark,” inEuropean Conference on Computer Vision, 2022

2022