pith. machine review for the scientific record. sign in

arxiv: 2605.07023 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: no theorem link

OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords 6D pose estimationone-viewnovel objectssemantic priorsprojection-equivariantmodel-freesymmetry handling
0
0 comments X

The pith

OneViewAll estimates 6D object poses from a single real RGB-D view by aligning reference and query images in a projection-equivariant space guided by hierarchical semantic priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets 6D pose estimation for novel objects when only one real-world RGB-D reference image is available and no CAD model or extra views exist. It replaces expensive rendering pipelines with direct alignment of observations inside a space that respects different camera projections. Hierarchical priors are fused at three scales: broad category and scene context for quick hypothesis generation, object symmetry to complete hidden geometry, and local patches for precise refinement. If the alignment step succeeds, the method delivers strong accuracy on standard benchmarks while keeping inference fast enough for practical use.

Core claim

OneViewAll performs pose estimation via a Project-and-Compare paradigm that directly aligns reference and query observations within a projection-equivariant space. It progressively integrates hierarchical semantic priors across three levels: category- and scene-level priors for efficient hypothesis initialization, object-level symmetry priors for geometry completion via mirror fusion, and patch-level priors for discriminative refinement. On the LINEMOD dataset this yields 92.5 percent ADD-0.1 accuracy using only one real reference view, substantially above the 52.6 percent of the prior single-view baseline.

What carries the argument

The Project-and-Compare paradigm that aligns reference and query observations directly in projection-equivariant space, driven by three-level hierarchical semantic priors (category/scene, symmetry, patch).

If this is right

  • Pose estimation becomes feasible for novel objects in settings where CAD models and multi-view capture are impractical.
  • Symmetric, textureless, and partially occluded objects can be handled by the symmetry-aware fusion step.
  • Inference remains low-latency, supporting real-time applications on standard hardware.
  • Consistent gains appear across LINEMOD, YCB-V, Real275, and Toyota-Light datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-view constraint could lower data-collection costs in robotics and augmented-reality pipelines that must handle new objects on the fly.
  • If the projection-equivariant alignment generalizes, the same prior hierarchy might extend to video-based tracking without retraining per frame.
  • Replacing CAD rendering with learned priors opens the possibility of combining this method with large-scale image-only pretraining.

Load-bearing premise

Reliable extraction and integration of the three-level semantic priors is possible from a single RGB-D view without CAD models or multi-view data.

What would settle it

A controlled test on objects whose category or symmetry priors are deliberately removed or corrupted, measuring whether ADD-0.1 accuracy collapses to the level of the prior single-view baseline.

Figures

Figures reproduced from arXiv: 2605.07023 by Huaping Liu, Jie Zhao, Xinyu Zhang, Yang Luo, Yan Gong, Yongsheng Gao.

Figure 1
Figure 1. Figure 1: Render-and-Compare vs. Project-and-Compare paradigms for model-free 6D pose estimation. (a) Traditional render-and-compare relies on CAD models and heavy rendering for hypothesis generation and comparison. (b) We operate directly on a single reference RGB-D view using symmetry-aware projection and semantic priors, enabling efficient pose alignment without explicit 3D assets or multi-view data. under a set … view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy-efficiency trade-off on the LINEMOD dataset. OneViewAll is compared with state-of-the-art model-free methods in terms of ADD-0.1 accuracy versus inference time. Real-world references are shown as red stars, rendered references as blue diamonds, and baselines as gray spheres. Our method achieves the best trade-off using only a single reference view (Nref = 1), with higher ADD-0.1 accuracy and lower… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of OneViewAll. The pipeline recovers 6D poses via three stages: (1) Initialization (once): coarse rotation and translation hypotheses P (0) are sampled using category- and scene-level semantic priors; (2) Iterative refinement (K iterations): each pose P (k) n is progressively optimized via a refinement network F, integrating Mirror Fusion for symmetry-aware geometry completion and Patc… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on LINEMOD using real reference images. Red and green boxes denote ground-truth and predicted poses. Compared with Oryon [43], NOPE [44], and One2Any [45], our method achieves more accurate alignments. The projected image shows results with symmetry-aware mirror fusion. in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on the LM-O dataset using rendered reference images. White boxes denote ground-truth poses and colored boxes denote predicted poses. OneViewAll produces accurate and stable pose estimates under severe occlusion and clutter, while model-based baselines (ZeroPose [16], MegaPose [37], SAM-6D [13], FoundPose [31], and GigaPose [14]) frequently suffer from drift, symmetry flips, or missed… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on the visibility threshold τ /D and its impact on accuracy and efficiency on the LINEMOD dataset. As τ /D increases, the number of hypotheses is pruned while maintaining high ADD￾0.1 accuracy. The optimal operating point (τ /D ≈ 0.2) reduces inference time by over 60% (from 517 ms to 375 ms) with negligible accuracy drop. This demonstrates the effectiveness of our category￾and scene-level p… view at source ↗
read the original abstract

In many practical 6D object pose estimation scenarios, we often have access to only a single real-world RGB-D reference view per object, typically without CAD models. Existing methods largely rely on explicit 3D models or multi-view data, which limits their scalability. To address this challenging single-reference model-free setting, we propose \textbf{OneViewAll}, a semantic-prior-guided framework that performs pose estimation via a novel Project-and-Compare paradigm. Instead of relying on computationally expensive CAD-based rendering, our method directly aligns reference and query observations within a projection-equivariant space. OneViewAll progressively integrates hierarchical semantic priors across three levels: (1) \textit{category- and scene-level} priors for efficient hypothesis initialization; (2) \textit{object-level symmetry} priors for geometry completion via mirror fusion; and (3) \textit{patch-level} priors for discriminative refinement. Extensive experiments demonstrate that OneViewAll achieves \textbf{92.5\%} ADD-0.1 accuracy on the LINEMOD dataset using only one real reference view -- significantly outperforming the CVPR 2025 baseline One2Any (52.6\%). It also yields consistent improvements on YCB-V, Real275, and Toyota-Light while maintaining low inference latency. Our results underscore the efficacy of symmetry-aware projection in handling symmetric, texture-less, and occluded objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes OneViewAll, a framework for 6D pose estimation of novel objects from a single real RGB-D reference view without CAD models or multi-view data. It introduces a Project-and-Compare paradigm that aligns observations in a projection-equivariant space by progressively incorporating hierarchical semantic priors: category- and scene-level for hypothesis initialization, object-level symmetry for geometry completion via mirror fusion, and patch-level for refinement. Experiments report 92.5% ADD-0.1 accuracy on LINEMOD (outperforming the CVPR 2025 One2Any baseline at 52.6%), with consistent gains on YCB-V, Real275, and Toyota-Light at low latency.

Significance. If the quantitative results hold under rigorous validation, the work would be significant for practical 6D pose estimation in data-scarce settings, as it reduces dependence on explicit 3D models. The hierarchical prior integration for handling symmetry, textureless, and occluded objects offers a potentially scalable alternative to rendering-based methods, though its efficacy depends on the reliability of single-view symmetry extraction.

major comments (2)
  1. [Abstract and method overview (hierarchical semantic priors)] The central performance claim (92.5% ADD-0.1 on LINEMOD) rests on object-level symmetry priors enabling reliable mirror fusion for geometry completion, yet single-view RGB-D for novel objects provides no unique solution for symmetry axes/planes on textureless or occluded instances. No quantitative validation of symmetry accuracy or ablation removing the mirror-fusion component is provided, directly undermining the reported gains over the baseline.
  2. [Abstract] The abstract states strong quantitative results but provides no details on experimental protocols (e.g., reference view selection criteria, test instance count, error analysis, or controls for the single-view setting), making it impossible to verify whether the data supports the 92.5% vs. 52.6% comparison.
minor comments (2)
  1. [Method overview] The projection-equivariant space is introduced as a novel alignment domain but lacks a clear formal definition or properties in the provided description, which could confuse readers on how it differs from standard feature spaces.
  2. [Abstract] Consider adding a reference to the CVPR 2025 One2Any baseline for context on the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to incorporate additional validation and experimental details as outlined below.

read point-by-point responses
  1. Referee: [Abstract and method overview (hierarchical semantic priors)] The central performance claim (92.5% ADD-0.1 on LINEMOD) rests on object-level symmetry priors enabling reliable mirror fusion for geometry completion, yet single-view RGB-D for novel objects provides no unique solution for symmetry axes/planes on textureless or occluded instances. No quantitative validation of symmetry accuracy or ablation removing the mirror-fusion component is provided, directly undermining the reported gains over the baseline.

    Authors: We agree that symmetry inference from a single RGB-D view of novel objects is inherently ambiguous, particularly for textureless or occluded cases, and that the original submission lacked explicit quantitative validation or an ablation for the mirror-fusion step. Our approach addresses ambiguity by conditioning symmetry hypothesis generation on category-level semantic priors and applying mirror fusion only above a learned confidence threshold. In the revision we have added a new subsection (4.4) containing: (i) quantitative symmetry accuracy results against available ground-truth axes on LINEMOD (87.4% mean accuracy), and (ii) an ablation removing the object-level symmetry prior and mirror fusion, which drops performance from 92.5% to 79.8% ADD-0.1. These results are now reported in Table 5 and support the contribution of this component to the gains over One2Any. revision: yes

  2. Referee: [Abstract] The abstract states strong quantitative results but provides no details on experimental protocols (e.g., reference view selection criteria, test instance count, error analysis, or controls for the single-view setting), making it impossible to verify whether the data supports the 92.5% vs. 52.6% comparison.

    Authors: We acknowledge that the original abstract was too concise to convey the experimental controls. The revised abstract now states: 'using one fixed real RGB-D reference view per novel object selected from the training split, evaluated on the full standard test sets (e.g., 100+ instances per object on LINEMOD) under single-view constraints without CAD or multi-view data.' We have also expanded Section 4.1 with a dedicated protocol description covering reference-view selection (reproducible random seed from training views), exact test-instance counts, per-object error breakdowns, and explicit single-view controls. These additions allow direct verification of the reported comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework introduces independent components

full rationale

The paper describes a new semantic-prior-guided framework (OneViewAll) for single-view 6D pose estimation via a Project-and-Compare paradigm, integrating category/scene priors, object symmetry for mirror fusion, and patch-level priors. No equations, derivations, or load-bearing self-citations appear in the provided text that would reduce any claimed result to fitted inputs, self-defined terms, or prior author work by construction. Performance claims (e.g., 92.5% ADD-0.1 on LINEMOD) are presented as experimental outcomes rather than mathematical reductions. The derivation chain is self-contained as it proposes novel integration steps for the model-free setting without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into any free parameters or axioms; the described priors and projection space are treated as domain assumptions without explicit justification or evidence of independence.

axioms (1)
  • domain assumption Semantic priors at category, object, and patch levels are available and integrable for novel objects without CAD models
    Invoked as the core of the framework but not justified or sourced in the abstract.
invented entities (1)
  • projection-equivariant space no independent evidence
    purpose: Direct alignment of reference and query observations without CAD rendering
    Introduced as part of the novel paradigm; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5562 in / 1377 out tokens · 36460 ms · 2026-05-11T01:33:19.975239+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

  1. [1]

    Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,

    G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16 606–16 616

  2. [2]

    Dgecn: A depth- guided edge convolutional network for end-to-end 6d pose estimation,

    T. Cao, F. Luo, Y . Fu, W. Zhang, S. Zheng, and C. Xiao, “Dgecn: A depth- guided edge convolutional network for end-to-end 6d pose estimation,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3773–3782

  3. [3]

    6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning,

    L. Zou, Z. Huang, N. Gu, and G. Wang, “6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 6907–6921, 2022

  4. [4]

    Hff6d: Hierarchical feature fusion network for robust 6d object pose tracking,

    J. Liu, W. Sun, C. Liu, X. Zhang, S. Fan, and W. Wu, “Hff6d: Hierarchical feature fusion network for robust 6d object pose tracking,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7719–7731, 2022

  5. [5]

    Zero-1-to-3: Zero-shot one image to 3d object,

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9264–9275

  6. [6]

    Ominnocs: A unified NOCS dataset and model for 3D lifting of 2D objects,

    A. Krishnan, A. Kundu, K.-K. Maninis, J. Hays, and M. Brown, “Ominnocs: A unified NOCS dataset and model for 3D lifting of 2D objects,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024, pp. 127–145. 12

  7. [7]

    Ua-pose: Uncertainty-aware 6d object pose estimation and online object completion with partial references,

    M.-F. Li, X. Yang, F.-E. Wang, H. Basak, Y . Sun, S. Gayaka, M. Sun, and C.-H. Kuo, “Ua-pose: Uncertainty-aware 6d object pose estimation and online object completion with partial references,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1180–1189

  8. [8]

    Bdr6d: Bidirectional deep residual fusion network for 6d pose estimation,

    P. Liu, Q. Zhang, and J. Cheng, “Bdr6d: Bidirectional deep residual fusion network for 6d pose estimation,”IEEE Transactions on Automation Science and Engineering, vol. 21, no. 2, pp. 1793–1804, 2024

  9. [9]

    Real-time perception meets reactive motion generation,

    D. Kappler, F. Meier, J. Issac, J. Mainprice, C. G. Cifuentes, M. W¨uthrich, V . Berenz, S. Schaal, N. Ratliff, and J. Bohg, “Real-time perception meets reactive motion generation,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1864–1871, 2018

  10. [10]

    Catgrasp: Learning category-level task-relevant grasping in clutter from simulation,

    B. Wen, W. Lian, K. Bekris, and S. Schaal, “Catgrasp: Learning category-level task-relevant grasping in clutter from simulation,” in2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 6401–6408

  11. [11]

    Pose estimation for aug- mented reality: A hands-on survey,

    E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for aug- mented reality: A hands-on survey,”IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, 2016

  12. [12]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects,

    B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 868–17 879

  13. [13]

    Sam-6d: Segment anything model meets zero-shot 6d object pose estimation,

    J. Lin, L. Liu, D. Lu, and K. Jia, “Sam-6d: Segment anything model meets zero-shot 6d object pose estimation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27 906– 27 916

  14. [14]

    Gigapose: Fast and robust novel object pose estimation via one correspondence,

    V . N. Nguyen, T. Groueix, M. Salzmann, and V . Lepetit, “Gigapose: Fast and robust novel object pose estimation via one correspondence,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9903–9913

  15. [15]

    Genflow: Generalizable recurrent flow for 6d pose refinement of novel objects,

    S. Moon, H. Son, D. Hur, and S. Kim, “Genflow: Generalizable recurrent flow for 6d pose refinement of novel objects,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 039–10 049

  16. [16]

    Zeropose: Cad-prompted zero-shot object 6d pose estimation in cluttered scenes,

    J. Chen, Z. Zhou, M. Sun, R. Zhao, L. Wu, T. Bao, and Z. He, “Zeropose: Cad-prompted zero-shot object 6d pose estimation in cluttered scenes,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1251–1264, 2025

  17. [17]

    Geo6d: Geometric-constraints-guided direct object 6d pose estimation network,

    J. Chen, M. Sun, Y . Zheng, T. Bao, Z. He, D. Li, G. Jin, Z. Rui, L. Wu, and X. Jiang, “Geo6d: Geometric-constraints-guided direct object 6d pose estimation network,”IEEE Transactions on Multimedia, vol. 27, pp. 5770–5783, 2025

  18. [18]

    Mh6d: Multi-hypothesis consistency learning for category-level 6-d object pose estimation,

    J. Liu, W. Sun, C. Liu, H. Yang, X. Zhang, and A. Mian, “Mh6d: Multi-hypothesis consistency learning for category-level 6-d object pose estimation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 3, pp. 4820–4833, 2025

  19. [19]

    Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images,

    Y . Liu, Y . Wen, S. Peng, C. Lin, X. Long, T. Komura, and W. Wang, “Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images,” inProceedings of the European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.10776

  20. [20]

    Onepose: One-shot object pose estimation without cad models,

    J. Sun, Z. Wang, S. Zhang, X. He, H. Zhao, G. Zhang, and X. Zhou, “Onepose: One-shot object pose estimation without cad models,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6815–6824

  21. [21]

    Bop challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects,

    T. Hodan, M. Sundermeyer, Y . Labb ´e, V . N. Nguyen, G. Wang, E. Brachmann, B. Drost, V . Lepetit, C. Rother, and J. Matas, “Bop challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 5610– 5619

  22. [22]

    Bop challenge 2024 on model-based and model-free 6d object pose estimation.arXiv preprint arXiv:2504.02812, 2025

    V . N. Nguyen, S. Tyree, A. Guo, M. Fourmy, A. Gouda, T. Lee, S. Moon, H. Son, L. Ranftl, J. Tremblay, E. Brachmann, B. Drost, V . Lepetit, C. Rother, S. Birchfield, J. Matas, Y . Labbe, M. Sundermeyer, and T. Hodan, “Bop challenge 2024 on model-based and model-free 6d object pose estimation,”arXiv preprint, vol. arXiv:2504.02812, 2025

  23. [23]

    Co-op: Correspondence- based novel object pose estimation,

    S. Moon, H. Son, D. Hur, and S. Kim, “Co-op: Correspondence- based novel object pose estimation,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 11 622– 11 632

  24. [24]

    Accurate and efficient zero-shot 6d pose estimation with frozen foundation models,

    A. Caraffa, D. Boscaini, and F. Poiesi, “Accurate and efficient zero- shot 6d pose estimation with frozen foundation models,”arXiv preprint arXiv:2506.09784, 2025

  25. [25]

    Epos: Estimating 6d pose of objects with symmetries,

    T. Hodaˇn, D. Bar ´ath, and J. Matas, “Epos: Estimating 6d pose of objects with symmetries,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 700–11 709

  26. [26]

    So- pose: Exploiting self-occlusion for direct 6d pose estimation,

    Y . Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So- pose: Exploiting self-occlusion for direct 6d pose estimation,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12 376–12 385

  27. [27]

    Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings,

    R. L. Haugaard and A. G. Buch, “Surfemb: Dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6739–6748

  28. [28]

    Occlusion-aware self-supervised monocular 6d object pose estimation,

    G. Wang, F. Manhardt, X. Liu, X. Ji, and F. Tombari, “Occlusion-aware self-supervised monocular 6d object pose estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, pp. 1788– 1803, 2024

  29. [29]

    Matchu: Matching unseen objects for 6d pose estimation from rgb-d images,

    J. Huang, H. Yu, K.-T. Yu, N. Navab, S. Ilic, and B. Busam, “Matchu: Matching unseen objects for 6d pose estimation from rgb-d images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 10 095–10 105

  30. [30]

    Mask6d: Masked pose priors for 6d object pose estimation,

    Y . Xie, H. Jiang, and J. Xie, “Mask6d: Masked pose priors for 6d object pose estimation,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 3545–3549

  31. [31]

    E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan,FoundPose: Unseen Object Pose Estimation with Foundation Features, 10 2024, pp. 163–182

  32. [32]

    Corr2distrib: Making ambiguous correspondences an ally to predict reliable 6d pose distributions,

    A. Brazi, B. Meden, F. M. de Chamisso, S. Bourgeois, and V . Lepetit, “Corr2distrib: Making ambiguous correspondences an ally to predict reliable 6d pose distributions,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 6440–6447, 2025

  33. [33]

    Epnp: An accurate o(n) solution to the pnp problem,

    V . Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,”International Journal of Computer Vision, vol. 81, 02 2009

  34. [34]

    Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,

    M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

  35. [35]

    Deepim: Deep iterative matching for 6d pose estimation,

    Y . Li, G. Wang, X. Ji, Y . Xiang, and D. Fox, “Deepim: Deep iterative matching for 6d pose estimation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 695–711. [Online]. Available: https://arxiv.org/abs/1804.00175

  36. [36]

    Cosypose: Consistent multi-view multi-object 6d pose estimation,

    Y . Labb ´e, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” inEuropean Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2008.08465

  37. [37]

    arXiv preprint arXiv:2212.06870 (2022)

    Y . Labb´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6d pose estimation of novel objects via render & compare,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205. PMLR, 2022, pp. 715–725, arXiv:2212.06870. [...

  38. [38]

    Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models,

    A. Caraffa, D. Boscaini, A. Hamza, and F. Poiesi, “Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 414–431, arXiv:2312.00947. [Online]. Available: https://arxiv.org/abs/2312.00947

  39. [39]

    Onda-pose: Occlusion-aware neural domain adaptation for self-supervised 6d object pose estimation,

    T. Tan and Q. Dong, “Onda-pose: Occlusion-aware neural domain adaptation for self-supervised 6d object pose estimation,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 16 829–16 838

  40. [40]

    Loftr: Detector-free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8918–8927

  41. [41]

    Fs6d: Few-shot 6d pose estimation of novel objects,

    Y . He, Y . Wang, H. Fan, J. Sun, and Q. Chen, “Fs6d: Few-shot 6d pose estimation of novel objects,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6804–6814

  42. [42]

    Onepose++: Keypoint-free one-shot object pose estimation without cad models,

    X. He, J. Sun, Y . Wang, D. Huang, H. Bao, and X. Zhou, “Onepose++: Keypoint-free one-shot object pose estimation without cad models,” in Advances in Neural Information Processing Systems, 2022, pp. 35 103– 35 115

  43. [43]

    Open- vocabulary object 6d pose estimation,

    J. Corsetti, D. Boscaini, C. Oh, A. Cavallaro, and F. Poiesi, “Open- vocabulary object 6d pose estimation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18 071– 18 080

  44. [44]

    Nope: Novel object pose estimation from a single image,

    V . N. Nguyen, T. Groueix, G. Ponimatkin, Y . Hu, R. Marlet, M. Salzmann, and V . Lepetit, “Nope: Novel object pose estimation from a single image,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17 923–17 932

  45. [45]

    One2any: One-reference 6d pose estimation for any object,

    M. Liu, S. Li, A. Chhatkuli, P. Truong, L. V . Gool, and F. Tombari, “One2any: One-reference 6d pose estimation for any object,” in2025 13 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 6457–6467

  46. [46]

    Any6d: Model-free 6d pose estimation of novel objects,

    T. Lee, B. Wen, M. Kang, G. Kang, I. S. Kweon, and K.-J. Yoon, “Any6d: Model-free 6d pose estimation of novel objects,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 11 633–11 643

  47. [47]

    Hippo: Harnessing image-to-3d priors for model-free zero-shot 6d pose estimation,

    Y . Liu, Z. Jiang, B. Xu, G. Wu, Y . Ren, T. Cao, B. Liu, R. H. Yang, A. Rasouli, and J. Shan, “Hippo: Harnessing image-to-3d priors for model-free zero-shot 6d pose estimation,”IEEE Robotics and Automation Letters, vol. 10, no. 8, pp. 8284–8291, 2025

  48. [48]

    ig-6dof: Model- free 6dof pose estimation for unseen object via iterative 3d gaussian splatting,

    T. Cao, F. Luo, J. Qin, Y . Jiang, Y . Wang, and C. Xiao, “ig-6dof: Model- free 6dof pose estimation for unseen object via iterative 3d gaussian splatting,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 6436–6446

  49. [49]

    Scalable Unseen Objects 6-DoF Absolute Pose Estimation with Robotic Integration

    J. Liu, W. Sun, K. Zeng, J. Zheng, H. Yang, H. Rahmani, A. Mian, and L. Wang, “Novel object 6d pose estimation with a single reference view,” arXiv preprint arXiv:2503.05578, 2025

  50. [50]

    Axispose: Model-free matching-free single-shot 6d object pose estimation via axis generation,

    Y . Zou, Z. Qi, Y . Liu, W. Liu, Z. Xu, W. Sun, X. Li, J. Yang, and Y . Zhang, “Axispose: Model-free matching-free single-shot 6d object pose estimation via axis generation,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2026

  51. [51]

    Instantpose: Zero-shot instance-level 6d pose estimation from a single view,

    F. D. Felice, A. Remus, S. Gasperini, B. Busam, L. Ott, S. Thalhammer, F. Tombari, and C. A. Avizzano, “Instantpose: Zero-shot instance-level 6d pose estimation from a single view,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 6023–6030, 2025

  52. [52]

    One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation,

    Z. Geng, N. Wang, S. Xu, C. Ye, B. Li, Z. Chen, S. Peng, and H. Zhao, “One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation,” inProceedings of The 9th Conference on Robot Learning. PMLR, 2025, pp. 168–197

  53. [53]

    State estimation for robotics [bookshelf],

    L. Carlone, “State estimation for robotics [bookshelf],”IEEE Control Systems Magazine, vol. 39, no. 3, pp. 86–88, 2019

  54. [54]

    Soft rasterizer: A differentiable renderer for image-based 3d reasoning,

    S. Liu, W. Chen, T. Li, and H. Li, “Soft rasterizer: A differentiable renderer for image-based 3d reasoning,” in2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7707–7716

  55. [55]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    J. Xu, W. Cheng, Y . Gao, X. Wang, S. Gao, and Y . Shan, “Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models,”arXiv preprint arXiv:2404.07191, 2024

  56. [56]

    Neutex: Neural texture mapping for volumetric neural rendering,

    F. Xiang, Z. Xu, M. Ha ˇsan, Y . Hold-Geoffroy, K. Sunkavalli, and H. Su, “Neutex: Neural texture mapping for volumetric neural rendering,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 7115–7124

  57. [57]

    Neural feature fusion fields: 3d distillation of self-supervised 2d image representations,

    V . Tschernezki, I. Laina, D. Larlus, and A. Vedaldi, “Neural feature fusion fields: 3d distillation of self-supervised 2d image representations,” in 2022 International Conference on 3D Vision (3DV), 2022, pp. 443–453

  58. [58]

    Learning repre- sentations by back-propagating errors,

    D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,”Nature, vol. 323, no. 6088, pp. 533–536, 1986

  59. [59]

    Cvt: Introducing convolutions to vision transformers,

    H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31

  60. [60]

    Mobilevit: light- weight, general-purpose, and mobile-friendly vision trans- former.arXiv preprint arXiv:2110.02178, 2021

    S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://arxiv.org/abs/2110.02178

  61. [61]

    Google scanned objects: A high-quality dataset of 3d scanned household objects,

    L. Downs, A. Francis, N. Conn, B. Khanna, F. Camp, S. Lee, K. Murphy, and J. Varley, “Google scanned objects: A high-quality dataset of 3d scanned household objects,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2552–2558

  62. [62]

    3d shapenets: A deep representation for volumetric shapes,

    Z. Wu, S. l. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1912–1920

  63. [63]

    Latentfusion: End-to- end differentiable reconstruction and rendering for unseen object pose estimation,

    K. Park, A. Mousavian, Y . Xiang, and D. Fox, “Latentfusion: End-to- end differentiable reconstruction and rendering for unseen object pose estimation,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 707–10 716

  64. [64]

    Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

    S. Hinterstoisser, S. Holzer, V . Lepetit, S. Ilicet al., “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inComputer Vision – ACCV 2012, ser. Lecture Notes in Computer Science, vol. 7724. Springer, 2013, pp. 548–562

  65. [65]

    Bop: Benchmark for 6d object pose estimation,

    T. Hodaˇn, F. Michel, E. Brachmann, W. Kehlet al., “Bop: Benchmark for 6d object pose estimation,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 19–34, arXiv:1808.08319. [Online]. Available: https://arxiv.org/abs/1808.08319

  66. [66]

    Normalized object coordinate space for category-level 6d object pose and size estimation,

    H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2637–2646

  67. [67]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988

  68. [68]

    Cnos: A strong baseline for cad-based novel object segmentation,

    V . N. Nguyen, T. Groueix, G. Ponimatkin, V . Lepetit, and T. Hodan, “Cnos: A strong baseline for cad-based novel object segmentation,” in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023, pp. 2126–2132

  69. [69]

    Wonder3d: Single image to 3d using cross-domain diffusion,

    X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobalt, and W. Wang, “Wonder3d: Single image to 3d using cross-domain diffusion,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9970–9980

  70. [70]

    Posediffusion: A coarse-to-fine framework for unseen object 6-dof pose estimation,

    J. Zhou, Q. Zhu, Y . Wang, M. Feng, C. Wu, X. Liu, J. Huang, and A. Mian, “Posediffusion: A coarse-to-fine framework for unseen object 6-dof pose estimation,”IEEE Transactions on Industrial Informatics, vol. 20, no. 9, pp. 11 127–11 138, 2024

  71. [71]

    Relpose++: Recov- ering 6d poses from sparse-view observations,

    A. Lin, J. Y . Zhang, D. Ramanan, and S. Tulsiani, “Relpose++: Recov- ering 6d poses from sparse-view observations,” in2024 International Conference on 3D Vision (3DV), 2024, pp. 106–115

  72. [72]

    Objectmatch: Robust registration using canonical object correspondences,

    C. G ¨umeli, A. Dai, and M. Nießner, “Objectmatch: Robust registration using canonical object correspondences,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 13 082– 13 091