pith. sign in

arxiv: 2606.22076 · v2 · pith:OXWCWWOQnew · submitted 2026-06-20 · 💻 cs.CV

Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Pith reviewed 2026-06-26 12:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords 6D pose estimationunseen objectssingle reference viewvision foundation modelscross-view semantic interactioncorrespondence learningsemantic priorsgeometric consistency
0
0 comments X

The pith

Cross-view semantic interaction with two training constraints yields more reliable correspondences for single-reference unseen object 6D pose estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that exchanging semantic context early between query and reference vision foundation model tokens, rather than treating features independently per view, produces point descriptors with stronger joint semantic and geometric discriminability for correspondence-based pose recovery. This matters for single-reference unseen object pose estimation because novel objects arrive with only one reference image, and current pipelines still struggle to match features reliably in difficult view pairs or cluttered scenes. The method introduces cross-view semantic interaction to form a prior, then applies intra-view structure preservation and reference-anchored geometric consistency losses at training time to keep the prior usable for rigid 3D correspondence. Weighted SVD then recovers the final pose from the learned matches. Experiments across six benchmarks under varied view-pair protocols report state-of-the-art accuracy at comparable inference speed.

Core claim

Instead of processing VFM features independently per view, the correspondence pipeline is built around an early cross-view semantic prior formed by dense query-reference token interaction. Direct interaction can disturb original token structure and still requires 3D representation consistency, so the intra-view structure preservation loss keeps intra-view token affinity intact while the reference-anchored geometric consistency loss enforces spatial consistency of decoded point features. The resulting correspondences support weighted SVD pose recovery for arbitrary novel objects from a single reference view.

What carries the argument

Cross-view semantic interaction (CVSI) that enables dense VFM tokens to exchange semantic context across views, regularized by intra-view structure preservation (IVSP) loss and reference-anchored geometric consistency (RAGC) loss to ensure reliability for rigid 3D correspondence.

If this is right

  • Learned point features gain joint semantic and geometric discriminability that helps correspondence in challenging matching scenarios.
  • The approach reaches state-of-the-art results on six benchmarks under multiple view-pair settings.
  • Inference speed stays comparable to prior VFM-based correspondence pipelines.
  • A new view-pair evaluation protocol derived from BOP YCB-V and TUD-L datasets exposes robustness under difficult reference-query conditions.
  • Pose recovery proceeds directly from the improved correspondences via weighted SVD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-interaction pattern could be tested on multi-reference or video sequences to see whether the prior scales without additional losses.
  • If the constraints generalize, similar token-exchange mechanisms might improve other 3D tasks that currently use frozen VFM features independently per frame.
  • The method's reliance on a single reference view suggests it could lower the data-collection cost for deploying pose estimators on new objects in robotics settings.
  • A direct test on real-time streaming camera input with varying lighting would reveal whether the learned consistency holds beyond static benchmark pairs.

Load-bearing premise

The two training-time constraints suffice to keep the cross-view semantic prior reliable for rigid 3D correspondence without introducing new mismatches or harming original VFM token discriminability.

What would settle it

An ablation experiment in which removing either the IVSP or RAGC loss causes accuracy to fall to or below the independent-feature baseline on the same view-pair protocol.

Figures

Figures reproduced from arXiv: 2606.22076 by Banglei Guan, Jiahong Chen, Jinghao Wang, Qifeng Yu, Zi Wang, Ziwen Wang.

Figure 1
Figure 1. Figure 1: Comparison with existing correspondence-based methods. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A sample with large viewpoint changes. Although the overlapping regions are very small, our method still achieves accurate pose estimation. Blue and green contours denote GT and estimated poses, respectively. Appearance, part structure, and contextual relations encoded by VFM tokens are therefore weakened when they are used only as independent view-wise descriptors. As a result, the decoded point features … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed pipeline. Given query and reference RGB-D observations after mask filtering, denoted by [I q | D q ] and [I r | D r ], we first extract image tokens X q and X r with a VFM [39, 40]. The sampled point clouds are then transformed into a global reference frame (GRF) [28] and processed by a geometric encoder [41] to produce geometric features G q and G r . Meanwhile, the image tokens e… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of cross-view semantic interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on six datasets. We visualize the pose estimation results of UNOPose [28], SinRef-6D [30], and our method. Blue and green contours denote GT and estimated poses, respectively. For clearer visualization, we also show the depth error heatmap of each detected object with respect to the ground-truth pose, namely the distance between each 3D point in the ground-truth depth map and its tra… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the attention maps in cross-view interac [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison under the challenging view pair [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of correspondence estimation. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effectivenes of IVSP loss. Without IVSP, cross-view in￾teraction over-smooths the intra-view similarity structure, weakening local part and boundary contrast. With IVSP, the interacted features preserve sharper relative similarity patterns inherited from DINO features. Best viewed when zoomed in. Number of Cross-View Semantic Interaction Layers. Table VII studies the effect of the number of CVSI layers. Re… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the reference viewpoint gap on YCB-V [ [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure cases under challenging scenes across different [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that single-reference unseen object 6D pose estimation can be improved by building the correspondence pipeline around an early cross-view semantic interaction (CVSI) module that lets dense VFM tokens from query and reference views exchange semantic context; two training-time constraints (IVSP loss preserving intra-view token affinity and RAGC loss enforcing reference-anchored spatial consistency of decoded points) are introduced to keep the resulting prior reliable for rigid 3D correspondence; final poses are recovered by weighted SVD; a new challenging view-pair protocol is constructed from YCB-V and TUD-L; and extensive experiments on six benchmarks under varied view-pair settings report state-of-the-art performance at comparable inference speed.

Significance. If the reported results hold, the work would demonstrate a practical way to inject cross-view semantic context into VFM-based correspondence pipelines while mitigating the two risks (disturbance of token structure and lack of 3D consistency) explicitly flagged in the abstract; the construction of a new view-pair protocol from BOP datasets is a concrete contribution that could aid future robustness evaluations.

major comments (2)
  1. [Method description (paragraph after CVSI definition)] The paragraph beginning 'To make this CVSI prior reliable...' asserts that IVSP and RAGC together suffice to prevent new mismatches and preserve VFM token discriminability, yet the supplied text contains no quantitative ablation isolating their individual effects or failure cases where the constraints are insufficient; because the SOTA claim rests directly on this sufficiency, the absence of such evidence is load-bearing for the central empirical argument.
  2. [Abstract and Experiments section] The abstract states that 'extensive experiments on six benchmarks ... show that our method achieves state-of-the-art performance' but supplies neither the numerical margins, per-benchmark tables, nor error-analysis breakdowns; without these data the central claim cannot be verified and the soundness assessment remains provisional.
minor comments (2)
  1. [Abstract] The abstract is unusually long; condensing the motivation and results paragraphs would improve readability while retaining all technical claims.
  2. [Pose recovery paragraph] Notation for the weighted SVD step is introduced without an explicit equation reference; adding an equation label would clarify how the learned correspondences are converted to pose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Method description (paragraph after CVSI definition)] The paragraph beginning 'To make this CVSI prior reliable...' asserts that IVSP and RAGC together suffice to prevent new mismatches and preserve VFM token discriminability, yet the supplied text contains no quantitative ablation isolating their individual effects or failure cases where the constraints are insufficient; because the SOTA claim rests directly on this sufficiency, the absence of such evidence is load-bearing for the central empirical argument.

    Authors: We agree with the referee that the method description would be improved by including quantitative evidence for the sufficiency of IVSP and RAGC. We will add ablation studies isolating their individual effects and discuss failure cases in the revised manuscript. This will support the central empirical argument more robustly. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract states that 'extensive experiments on six benchmarks ... show that our method achieves state-of-the-art performance' but supplies neither the numerical margins, per-benchmark tables, nor error-analysis breakdowns; without these data the central claim cannot be verified and the soundness assessment remains provisional.

    Authors: We agree that providing numerical margins and breakdowns would strengthen the abstract and experiments section. We will revise the abstract to include specific performance numbers and margins, and expand the experiments section with per-benchmark tables and error-analysis breakdowns to allow verification of the SOTA claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CVSI interaction plus two explicit training constraints (IVSP loss preserving intra-view affinity; RAGC loss enforcing reference-anchored geometric consistency) as new modules whose effectiveness is measured on external BOP benchmarks. No equations or steps reduce by construction to fitted inputs, self-citations, or renamed prior results. The central claim rests on the stated constraints being effective, which is an empirical question evaluated outside the derivation itself.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

Ledger constructed from components explicitly named in the abstract only; no numerical hyperparameters or external benchmarks are provided.

free parameters (1)
  • weights of IVSP and RAGC losses
    Standard in loss-based training; values are not stated in the abstract but must be chosen to balance the constraints.
axioms (2)
  • domain assumption VFM tokens contain dense visual-semantic cues (appearance, structure, context) that benefit from cross-view exchange before geometric decoding
    Invoked when describing why intra-view processing is insufficient.
  • domain assumption Rigid correspondence requires 3D representation consistency of decoded point features
    Stated as the reason the semantic prior still needs additional constraints.
invented entities (3)
  • Cross-View Semantic Interaction (CVSI) no independent evidence
    purpose: Enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior
    New interaction module introduced to address the stated limitation of prior pipelines.
  • Intra-View Structure Preservation (IVSP) loss no independent evidence
    purpose: Preserves original intra-view token affinity structure during interaction
    New training constraint introduced to counteract disturbance from CVSI.
  • Reference-Anchored Geometric Consistency (RAGC) loss no independent evidence
    purpose: Enforces spatial representation consistency of decoded point features
    New training constraint introduced to ensure reliability for rigid correspondence.

pith-pipeline@v0.9.1-grok · 5832 in / 1645 out tokens · 28335 ms · 2026-06-26T12:36:30.349714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 4 linked inside Pith

  1. [1]

    A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms

    Oliver Kroemer, Scott Niekum, and George Konidaris. “A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms”. In:JMLR22.30 (2021), pp. 1–82

  2. [2]

    Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

    Guoguang Du et al. “Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review”. In:Artificial Intelligence Review54.3 (2021), pp. 1677–1734

  3. [3]

    Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud

    Jianwei Guo et al. “Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud”. In:IEEE Transactions on Image Processing30 (2021), pp. 5072–5084

  4. [4]

    Domain-Translated 3D Object Pose Estimation

    Christos Papaioannidis, Vasileios Mygdalis, and Ioannis Pitas. “Domain-Translated 3D Object Pose Estimation”. In:IEEE Trans- actions on Image Processing29 (2020), pp. 9279–9291

  5. [5]

    Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

    Yinyu Nie et al. “Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image”. In: CVPR. 2020, pp. 55–64

  6. [6]

    Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation

    Siyuan Huang et al. “Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation”. In:NeurIPS. 2018

  7. [7]

    Pose estimation for augmented reality: a hands-on survey

    Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. “Pose estimation for augmented reality: a hands-on survey”. In:IEEE Transactions on Visualization and Computer Graphics22.12 (2015), pp. 2633–2651

  8. [8]

    Deep multi-state object pose estimation for augmented reality assembly

    Yongzhi Su et al. “Deep multi-state object pose estimation for augmented reality assembly”. In:IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). 2019, pp. 222–227

  9. [9]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Yu Xiang et al. “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes”. In:RSS. 2018

  10. [10]

    Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation

    Yongzhi Su et al. “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation”. In:CVPR. 2022, pp. 6738–6748

  11. [11]

    PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation

    Sida Peng et al. “PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence44.6 (2022), pp. 3212–3223

  12. [12]

    Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation

    Yongliang Lin et al. “Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation”. In:IEEE Transactions on Image Processing34 (2025), pp. 1700–1711

  13. [13]

    Gdrnpp: A geometry-guided and fully learning- based object pose estimator

    Xingyu Liu et al. “Gdrnpp: A geometry-guided and fully learning- based object pose estimator”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  14. [14]

    Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera

    Zibin Liu et al. “Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera”. In:IEEE Transactions on Image Processing33 (2024), pp. 4765–4780

  15. [15]

    Normalized object coordinate space for category-level 6D object pose and size estimation

    He Wang et al. “Normalized object coordinate space for category-level 6D object pose and size estimation”. In:CVPR. 2019, pp. 2642–2651. SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 13

  16. [16]

    6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning

    Lu Zou et al. “6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning”. In:IEEE Transactions on Image Processing31 (2022), pp. 6907–6921

  17. [17]

    Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds

    Xiaolong Li et al. “Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds”. In: NeurIPS. 2021

  18. [18]

    Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation

    Yamei Chen et al. “Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation”. In:CVPR. 2024, pp. 9959– 9969

  19. [19]

    ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

    Huan Ren et al. “ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation”. In:CVPR. 2026, pp. 14315–14324

  20. [20]

    Deep learning-based object pose estimation: A comprehensive survey

    Jian Liu et al. “Deep learning-based object pose estimation: A comprehensive survey”. In:arXiv preprint arXiv:2405.07801(2024)

  21. [21]

    Challenges for monocular 6-d object pose estimation in robotics

    Stefan Thalhammer et al. “Challenges for monocular 6-d object pose estimation in robotics”. In:IEEE Transactions on Robotics40 (2024), pp. 4065–4084

  22. [22]

    Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation

    Keunhong Park et al. “Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation”. In:CVPR. 2020, pp. 10710–10719

  23. [23]

    MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

    Yann Labb ´e et al. “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare”. In:CoRL. PMLR. 2023, pp. 715–725

  24. [24]

    FoundPose: Unseen Object Pose Estimation with Foundation Features

    Evin Pınar ¨Ornek et al. “FoundPose: Unseen Object Pose Estimation with Foundation Features”. In:CVPR. 2024

  25. [25]

    FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

    Bowen Wen et al. “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects”. In:CVPR. 2024

  26. [26]

    Sam-6d: Segment anything model meets zero-shot 6d object pose estimation

    Jiehong Lin et al. “Sam-6d: Segment anything model meets zero-shot 6d object pose estimation”. In:CVPR. 2024

  27. [27]

    PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting

    Wubin Shi, Shaoyan Gai, and Feipeng Da. “PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting”. In:CVPR. 2026, pp. 4698–4707

  28. [28]

    UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

    Xingyu Liu et al. “UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image”. In:CVPR. June 2025, pp. 22023–22034

  29. [29]

    COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

    Yuchen Che et al. “COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation”. In:CVPR. 2026, pp. 11567–11578

  30. [30]

    Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration

    Jian Liu et al. “Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration”. In:IEEE Transactions on Robotics 42 (2026), pp. 1884–1901

  31. [31]

    Nope: Novel object pose estimation from a single image

    Van Nguyen Nguyen et al. “Nope: Novel object pose estimation from a single image”. In:CVPR. 2024, pp. 17923–17932

  32. [32]

    Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

    Zhiwen Fan et al. “Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference”. In:CVPR Workshops. 2024

  33. [33]

    Open-vocabulary object 6D pose estimation

    Jaime Corsetti et al. “Open-vocabulary object 6D pose estimation”. In:CVPR. 2024

  34. [34]

    High-Resolution Open-V ocabulary Object 6D Pose Estimation

    Jaime Corsetti et al. “High-Resolution Open-V ocabulary Object 6D Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence48.2 (2026), pp. 2066–2077

  35. [35]

    One2Any: One-Reference 6D Pose Estimation for Any Object

    Mengya Liu et al. “One2Any: One-Reference 6D Pose Estimation for Any Object”. In:CVPR. 2025, pp. 6457–6467

  36. [36]

    Any6D: Model-free 6D pose estimation of novel objects

    Taeyeop Lee et al. “Any6D: Model-free 6D pose estimation of novel objects”. In:CVPR. 2025, pp. 11633–11643

  37. [37]

    CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

    Dexin Zuo et al. “CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation”. In: AAAI. V ol. 40. 16. 2026, pp. 14122–14130

  38. [38]

    ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

    Liming Kuang et al. “ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors”. In:CVPR. 2026, pp. 26582– 26592

  39. [39]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)

  40. [40]

    “Dinov3”

    Oriane Sim ´eoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104 (2025)

  41. [41]

    Geotransformer: Fast and robust point cloud registration with geometric transformer

    Zheng Qin et al. “Geotransformer: Fast and robust point cloud registration with geometric transformer”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2023)

  42. [42]

    BOP: Benchmark for 6D Object Pose Estima- tion

    Tomas Hodan et al. “BOP: Benchmark for 6D Object Pose Estima- tion”. In:ECCV. 2018, pp. 19–34

  43. [43]

    Learning 6D object pose estimation using 3D object coordinates

    Eric Brachmann et al. “Learning 6D object pose estimation using 3D object coordinates”. In:ECCV. 2014

  44. [44]

    Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

    Stefan Hinterstoisser et al. “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes”. In: ACCV. Springer. 2012, pp. 548–562

  45. [45]

    GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

    Van Nguyen Nguyen et al. “GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence”. In:CVPR. 2024

  46. [46]

    Segment anything

    Alexander Kirillov et al. “Segment anything”. In:ICCV. 2023, pp. 4015–4026

  47. [47]

    Onepose: One-shot object pose estimation without cad models

    Jiaming Sun et al. “Onepose: One-shot object pose estimation without cad models”. In:CVPR. 2022, pp. 6825–6834

  48. [48]

    Onepose++: Keypoint-free one-shot object pose estimation without CAD models

    Xingyi He et al. “Onepose++: Keypoint-free one-shot object pose estimation without CAD models”. In:NeurIPS. 2022, pp. 35103– 35115

  49. [49]

    Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images

    Yuan Liu et al. “Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images”. In:ECCV. 2022

  50. [50]

    Fs6d: Few-shot 6d pose estimation of novel objects

    Yisheng He et al. “Fs6d: Few-shot 6d pose estimation of novel objects”. In:CVPR. 2022, pp. 6814–6824

  51. [51]

    A Method for Registration of 3- D Shapes

    Paul J. Besl and Neil D. McKay. “A Method for Registration of 3- D Shapes”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence14.2 (1992), pp. 239–256

  52. [52]

    Efficient variants of the ICP algorithm

    Szymon Rusinkiewicz and Marc Levoy. “Efficient variants of the ICP algorithm”. In:Proceedings third international conference on 3-D digital imaging and modeling. IEEE. 2001, pp. 145–152

  53. [53]

    Fast point feature histograms (FPFH) for 3D registration

    Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. “Fast point feature histograms (FPFH) for 3D registration”. In:ICRA. IEEE. 2009, pp. 3212–3217

  54. [54]

    Model globally, match locally: Efficient and robust 3D object recognition

    Bertram Drost et al. “Model globally, match locally: Efficient and robust 3D object recognition”. In:CVPR. Ieee. 2010, pp. 998–1005

  55. [55]

    Predator: Registration of 3d point clouds with low overlap

    Shengyu Huang et al. “Predator: Registration of 3d point clouds with low overlap”. In:CVPR. 2021, pp. 4267–4276

  56. [56]

    LoFTR: Detector-free local feature matching with transformers

    Jiaming Sun et al. “LoFTR: Detector-free local feature matching with transformers”. In:CVPR. 2021, pp. 8922–8931

  57. [57]

    O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views

    Lorenzo Mur-Labadia et al. “O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views”. In:ICCV. 2025

  58. [58]

    V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

    Jiancheng Pan et al. “V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence”. In:arXiv preprint arXiv:2511.20886(2025)

  59. [59]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang et al. “Vggt: Visual geometry grounded transformer”. In:CVPR. 2025, pp. 5294–5306

  60. [60]

    PointNet: Deep learning on point sets for 3D classification and segmentation

    Charles R Qi et al. “PointNet: Deep learning on point sets for 3D classification and segmentation”. In:CVPR. 2017

  61. [61]

    Unsupervised semantic segmentation by dis- tilling feature correspondences

    Mark Hamilton et al. “Unsupervised semantic segmentation by dis- tilling feature correspondences”. In:arXiv preprint arXiv:2203.08414 (2022)

  62. [62]

    Representa- tion learning with contrastive predictive coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representa- tion learning with contrastive predictive coding”. In:arXiv preprint arXiv:1807.03748(2018)

  63. [63]

    PyTorch: An Imperative Style, High-performance Deep Learning Library

    Adam Paszke et al. “PyTorch: An Imperative Style, High-performance Deep Learning Library”. In:NeurIPS. 2019, pp. 8026–8037

  64. [64]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy et al. “An image is worth 16x16 words: Trans- formers for image recognition at scale”. In:ICLR. 2021

  65. [65]

    BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects

    Tomas Hodan et al. “BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects”. In:CVPR Workshops. 2024

  66. [66]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:ICLR. Ed. by Yoshua Bengio and Yann LeCun. 2015

  67. [67]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Frank Hutter Ilya Loshchilov. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In:ICLR. 2017

  68. [68]

    On evaluation of 6D object pose estimation

    Tom ´aˇs Hoda ˇn, Ji ˇr´ı Matas, and ˇStˇep´an Obdr ˇz´alek. “On evaluation of 6D object pose estimation”. In:ECCV. Springer. 2016, pp. 606–619

  69. [69]

    3D Registration with Maximal Cliques

    Xiyu Zhang et al. “3D Registration with Maximal Cliques”. In:CVPR. 2023, pp. 17745–17754

  70. [70]

    Fully convolu- tional geometric features

    Christopher Choy, Jaesik Park, and Vladlen Koltun. “Fully convolu- tional geometric features”. In:ICCV. 2019, pp. 8958–8966

  71. [71]

    UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration

    Zhilei Chen et al. “UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration”. In:Computer Graphics Forum41 (2022), pp. 87–98

  72. [72]

    Learning general and distinctive 3D local deep descriptors for point cloud registration

    Fabio Poiesi and Davide Boscaini. “Learning general and distinctive 3D local deep descriptors for point cloud registration”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence45.3 (2022), pp. 3979–3985

  73. [73]

    FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models

    Andrea Caraffa et al. “FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models”. In:ECCV. 2024

  74. [74]

    Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment

    Jianyuan Wang, Christian Rupprecht, and David Novotny. “Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment”. In:ICCV. 2023, pp. 9773–9783

  75. [75]

    Relpose++: Recovering 6d poses from sparse-view observations

    Amy Lin et al. “Relpose++: Recovering 6d poses from sparse-view observations”. In:3DV. 2024

  76. [76]

    Object recognition from local scale-invariant fea- tures

    David G Lowe. “Object recognition from local scale-invariant fea- tures”. In:ICCV. V ol. 2. IEEE. 1999, pp. 1150–1157

  77. [77]

    Objectmatch: Ro- bust registration using canonical object correspondences

    Can G ¨umeli, Angela Dai, and Matthias Nießner. “Objectmatch: Ro- bust registration using canonical object correspondences”. In:CVPR. 2023, pp. 13082–13091