Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Banglei Guan; Jiahong Chen; Jinghao Wang; Qifeng Yu; Zi Wang; Ziwen Wang

arxiv: 2606.22076 · v2 · pith:OXWCWWOQnew · submitted 2026-06-20 · 💻 cs.CV

Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Jiahong Chen , Jinghao Wang , Ziwen Wang , Zi Wang , Banglei Guan , Qifeng Yu This is my paper

Pith reviewed 2026-06-26 12:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords 6D pose estimationunseen objectssingle reference viewvision foundation modelscross-view semantic interactioncorrespondence learningsemantic priorsgeometric consistency

0 comments

The pith

Cross-view semantic interaction with two training constraints yields more reliable correspondences for single-reference unseen object 6D pose estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that exchanging semantic context early between query and reference vision foundation model tokens, rather than treating features independently per view, produces point descriptors with stronger joint semantic and geometric discriminability for correspondence-based pose recovery. This matters for single-reference unseen object pose estimation because novel objects arrive with only one reference image, and current pipelines still struggle to match features reliably in difficult view pairs or cluttered scenes. The method introduces cross-view semantic interaction to form a prior, then applies intra-view structure preservation and reference-anchored geometric consistency losses at training time to keep the prior usable for rigid 3D correspondence. Weighted SVD then recovers the final pose from the learned matches. Experiments across six benchmarks under varied view-pair protocols report state-of-the-art accuracy at comparable inference speed.

Core claim

Instead of processing VFM features independently per view, the correspondence pipeline is built around an early cross-view semantic prior formed by dense query-reference token interaction. Direct interaction can disturb original token structure and still requires 3D representation consistency, so the intra-view structure preservation loss keeps intra-view token affinity intact while the reference-anchored geometric consistency loss enforces spatial consistency of decoded point features. The resulting correspondences support weighted SVD pose recovery for arbitrary novel objects from a single reference view.

What carries the argument

Cross-view semantic interaction (CVSI) that enables dense VFM tokens to exchange semantic context across views, regularized by intra-view structure preservation (IVSP) loss and reference-anchored geometric consistency (RAGC) loss to ensure reliability for rigid 3D correspondence.

If this is right

Learned point features gain joint semantic and geometric discriminability that helps correspondence in challenging matching scenarios.
The approach reaches state-of-the-art results on six benchmarks under multiple view-pair settings.
Inference speed stays comparable to prior VFM-based correspondence pipelines.
A new view-pair evaluation protocol derived from BOP YCB-V and TUD-L datasets exposes robustness under difficult reference-query conditions.
Pose recovery proceeds directly from the improved correspondences via weighted SVD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-interaction pattern could be tested on multi-reference or video sequences to see whether the prior scales without additional losses.
If the constraints generalize, similar token-exchange mechanisms might improve other 3D tasks that currently use frozen VFM features independently per frame.
The method's reliance on a single reference view suggests it could lower the data-collection cost for deploying pose estimators on new objects in robotics settings.
A direct test on real-time streaming camera input with varying lighting would reveal whether the learned consistency holds beyond static benchmark pairs.

Load-bearing premise

The two training-time constraints suffice to keep the cross-view semantic prior reliable for rigid 3D correspondence without introducing new mismatches or harming original VFM token discriminability.

What would settle it

An ablation experiment in which removing either the IVSP or RAGC loss causes accuracy to fall to or below the independent-feature baseline on the same view-pair protocol.

Figures

Figures reproduced from arXiv: 2606.22076 by Banglei Guan, Jiahong Chen, Jinghao Wang, Qifeng Yu, Zi Wang, Ziwen Wang.

**Figure 2.** Figure 2: A sample with large viewpoint changes. Although the overlapping regions are very small, our method still achieves accurate pose estimation. Blue and green contours denote GT and estimated poses, respectively. Appearance, part structure, and contextual relations encoded by VFM tokens are therefore weakened when they are used only as independent view-wise descriptors. As a result, the decoded point features … view at source ↗

**Figure 3.** Figure 3: Overview of the proposed pipeline. Given query and reference RGB-D observations after mask filtering, denoted by [I q | D q ] and [I r | D r ], we first extract image tokens X q and X r with a VFM [39, 40]. The sampled point clouds are then transformed into a global reference frame (GRF) [28] and processed by a geometric encoder [41] to produce geometric features G q and G r . Meanwhile, the image tokens e… view at source ↗

**Figure 4.** Figure 4: Illustration of cross-view semantic interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on six datasets. We visualize the pose estimation results of UNOPose [28], SinRef-6D [30], and our method. Blue and green contours denote GT and estimated poses, respectively. For clearer visualization, we also show the depth error heatmap of each detected object with respect to the ground-truth pose, namely the distance between each 3D point in the ground-truth depth map and its tra… view at source ↗

**Figure 7.** Figure 7: Visualization of the attention maps in cross-view interac [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison under the challenging view pair [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization of correspondence estimation. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Effectivenes of IVSP loss. Without IVSP, cross-view interaction over-smooths the intra-view similarity structure, weakening local part and boundary contrast. With IVSP, the interacted features preserve sharper relative similarity patterns inherited from DINO features. Best viewed when zoomed in. Number of Cross-View Semantic Interaction Layers. Table VII studies the effect of the number of CVSI layers. Re… view at source ↗

**Figure 10.** Figure 10: Effect of the reference viewpoint gap on YCB-V [ [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Failure cases under challenging scenes across different [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an early cross-view semantic interaction module plus two stabilizing losses to VFM-based single-reference pose estimation and reports SOTA numbers on six benchmarks under a new view-pair protocol.

read the letter

The main point is that they move the cross-view exchange earlier in the pipeline for single-reference unseen 6D pose, using CVSI on VFM tokens and then IVSP to hold onto original intra-view affinities plus RAGC to keep decoded points geometrically consistent with the reference. The pose still comes out of weighted SVD. They also release a harder view-pair split from YCB-V and TUD-L.

This is a straightforward engineering step that directly tackles the two problems they name: disturbed token structure and missing 3D consistency. The losses are simple and targeted, and the view-pair protocol is a useful addition for testing real matching difficulty. If the numbers hold up in the full experiments, the method gives a measurable lift while keeping inference speed comparable.

The soft spot is the usual one for this style of paper: the gains rest on the two losses actually delivering reliable priors without side effects, and the abstract gives no ablations or per-scenario breakdowns. The full paper presumably supplies those, but the claim is only as strong as the tables.

This is for groups already working on correspondence-based pose with foundation models who need to drop the reference count to one. A reader who follows BOP-style benchmarks will find the protocol and the pipeline details worth looking at. It is solid enough to go to a serious referee; the approach is incremental but the evaluation setup is honest and the fixes are explicit.

Referee Report

2 major / 2 minor

Summary. The paper claims that single-reference unseen object 6D pose estimation can be improved by building the correspondence pipeline around an early cross-view semantic interaction (CVSI) module that lets dense VFM tokens from query and reference views exchange semantic context; two training-time constraints (IVSP loss preserving intra-view token affinity and RAGC loss enforcing reference-anchored spatial consistency of decoded points) are introduced to keep the resulting prior reliable for rigid 3D correspondence; final poses are recovered by weighted SVD; a new challenging view-pair protocol is constructed from YCB-V and TUD-L; and extensive experiments on six benchmarks under varied view-pair settings report state-of-the-art performance at comparable inference speed.

Significance. If the reported results hold, the work would demonstrate a practical way to inject cross-view semantic context into VFM-based correspondence pipelines while mitigating the two risks (disturbance of token structure and lack of 3D consistency) explicitly flagged in the abstract; the construction of a new view-pair protocol from BOP datasets is a concrete contribution that could aid future robustness evaluations.

major comments (2)

[Method description (paragraph after CVSI definition)] The paragraph beginning 'To make this CVSI prior reliable...' asserts that IVSP and RAGC together suffice to prevent new mismatches and preserve VFM token discriminability, yet the supplied text contains no quantitative ablation isolating their individual effects or failure cases where the constraints are insufficient; because the SOTA claim rests directly on this sufficiency, the absence of such evidence is load-bearing for the central empirical argument.
[Abstract and Experiments section] The abstract states that 'extensive experiments on six benchmarks ... show that our method achieves state-of-the-art performance' but supplies neither the numerical margins, per-benchmark tables, nor error-analysis breakdowns; without these data the central claim cannot be verified and the soundness assessment remains provisional.

minor comments (2)

[Abstract] The abstract is unusually long; condensing the motivation and results paragraphs would improve readability while retaining all technical claims.
[Pose recovery paragraph] Notation for the weighted SVD step is introduced without an explicit equation reference; adding an equation label would clarify how the learned correspondences are converted to pose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Method description (paragraph after CVSI definition)] The paragraph beginning 'To make this CVSI prior reliable...' asserts that IVSP and RAGC together suffice to prevent new mismatches and preserve VFM token discriminability, yet the supplied text contains no quantitative ablation isolating their individual effects or failure cases where the constraints are insufficient; because the SOTA claim rests directly on this sufficiency, the absence of such evidence is load-bearing for the central empirical argument.

Authors: We agree with the referee that the method description would be improved by including quantitative evidence for the sufficiency of IVSP and RAGC. We will add ablation studies isolating their individual effects and discuss failure cases in the revised manuscript. This will support the central empirical argument more robustly. revision: yes
Referee: [Abstract and Experiments section] The abstract states that 'extensive experiments on six benchmarks ... show that our method achieves state-of-the-art performance' but supplies neither the numerical margins, per-benchmark tables, nor error-analysis breakdowns; without these data the central claim cannot be verified and the soundness assessment remains provisional.

Authors: We agree that providing numerical margins and breakdowns would strengthen the abstract and experiments section. We will revise the abstract to include specific performance numbers and margins, and expand the experiments section with per-benchmark tables and error-analysis breakdowns to allow verification of the SOTA claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CVSI interaction plus two explicit training constraints (IVSP loss preserving intra-view affinity; RAGC loss enforcing reference-anchored geometric consistency) as new modules whose effectiveness is measured on external BOP benchmarks. No equations or steps reduce by construction to fitted inputs, self-citations, or renamed prior results. The central claim rests on the stated constraints being effective, which is an empirical question evaluated outside the derivation itself.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

Ledger constructed from components explicitly named in the abstract only; no numerical hyperparameters or external benchmarks are provided.

free parameters (1)

weights of IVSP and RAGC losses
Standard in loss-based training; values are not stated in the abstract but must be chosen to balance the constraints.

axioms (2)

domain assumption VFM tokens contain dense visual-semantic cues (appearance, structure, context) that benefit from cross-view exchange before geometric decoding
Invoked when describing why intra-view processing is insufficient.
domain assumption Rigid correspondence requires 3D representation consistency of decoded point features
Stated as the reason the semantic prior still needs additional constraints.

invented entities (3)

Cross-View Semantic Interaction (CVSI) no independent evidence
purpose: Enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior
New interaction module introduced to address the stated limitation of prior pipelines.
Intra-View Structure Preservation (IVSP) loss no independent evidence
purpose: Preserves original intra-view token affinity structure during interaction
New training constraint introduced to counteract disturbance from CVSI.
Reference-Anchored Geometric Consistency (RAGC) loss no independent evidence
purpose: Enforces spatial representation consistency of decoded point features
New training constraint introduced to ensure reliability for rigid correspondence.

pith-pipeline@v0.9.1-grok · 5832 in / 1645 out tokens · 28335 ms · 2026-06-26T12:36:30.349714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 4 linked inside Pith

[1]

A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms

Oliver Kroemer, Scott Niekum, and George Konidaris. “A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms”. In:JMLR22.30 (2021), pp. 1–82

2021
[2]

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Guoguang Du et al. “Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review”. In:Artificial Intelligence Review54.3 (2021), pp. 1677–1734

2021
[3]

Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud

Jianwei Guo et al. “Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud”. In:IEEE Transactions on Image Processing30 (2021), pp. 5072–5084

2021
[4]

Domain-Translated 3D Object Pose Estimation

Christos Papaioannidis, Vasileios Mygdalis, and Ioannis Pitas. “Domain-Translated 3D Object Pose Estimation”. In:IEEE Trans- actions on Image Processing29 (2020), pp. 9279–9291

2020
[5]

Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

Yinyu Nie et al. “Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image”. In: CVPR. 2020, pp. 55–64

2020
[6]

Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation

Siyuan Huang et al. “Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation”. In:NeurIPS. 2018

2018
[7]

Pose estimation for augmented reality: a hands-on survey

Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. “Pose estimation for augmented reality: a hands-on survey”. In:IEEE Transactions on Visualization and Computer Graphics22.12 (2015), pp. 2633–2651

2015
[8]

Deep multi-state object pose estimation for augmented reality assembly

Yongzhi Su et al. “Deep multi-state object pose estimation for augmented reality assembly”. In:IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). 2019, pp. 222–227

2019
[9]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Yu Xiang et al. “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes”. In:RSS. 2018

2018
[10]

Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation

Yongzhi Su et al. “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation”. In:CVPR. 2022, pp. 6738–6748

2022
[11]

PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation

Sida Peng et al. “PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence44.6 (2022), pp. 3212–3223

2022
[12]

Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation

Yongliang Lin et al. “Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation”. In:IEEE Transactions on Image Processing34 (2025), pp. 1700–1711

2025
[13]

Gdrnpp: A geometry-guided and fully learning- based object pose estimator

Xingyu Liu et al. “Gdrnpp: A geometry-guided and fully learning- based object pose estimator”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025
[14]

Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera

Zibin Liu et al. “Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera”. In:IEEE Transactions on Image Processing33 (2024), pp. 4765–4780

2024
[15]

Normalized object coordinate space for category-level 6D object pose and size estimation

He Wang et al. “Normalized object coordinate space for category-level 6D object pose and size estimation”. In:CVPR. 2019, pp. 2642–2651. SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 13

2019
[16]

6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning

Lu Zou et al. “6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning”. In:IEEE Transactions on Image Processing31 (2022), pp. 6907–6921

2022
[17]

Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds

Xiaolong Li et al. “Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds”. In: NeurIPS. 2021

2021
[18]

Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation

Yamei Chen et al. “Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation”. In:CVPR. 2024, pp. 9959– 9969

2024
[19]

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

Huan Ren et al. “ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation”. In:CVPR. 2026, pp. 14315–14324

2026
[20]

Deep learning-based object pose estimation: A comprehensive survey

Jian Liu et al. “Deep learning-based object pose estimation: A comprehensive survey”. In:arXiv preprint arXiv:2405.07801(2024)

arXiv 2024
[21]

Challenges for monocular 6-d object pose estimation in robotics

Stefan Thalhammer et al. “Challenges for monocular 6-d object pose estimation in robotics”. In:IEEE Transactions on Robotics40 (2024), pp. 4065–4084

2024
[22]

Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation

Keunhong Park et al. “Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation”. In:CVPR. 2020, pp. 10710–10719

2020
[23]

MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

Yann Labb ´e et al. “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare”. In:CoRL. PMLR. 2023, pp. 715–725

2023
[24]

FoundPose: Unseen Object Pose Estimation with Foundation Features

Evin Pınar ¨Ornek et al. “FoundPose: Unseen Object Pose Estimation with Foundation Features”. In:CVPR. 2024

2024
[25]

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen et al. “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects”. In:CVPR. 2024

2024
[26]

Sam-6d: Segment anything model meets zero-shot 6d object pose estimation

Jiehong Lin et al. “Sam-6d: Segment anything model meets zero-shot 6d object pose estimation”. In:CVPR. 2024

2024
[27]

PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting

Wubin Shi, Shaoyan Gai, and Feipeng Da. “PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting”. In:CVPR. 2026, pp. 4698–4707

2026
[28]

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

Xingyu Liu et al. “UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image”. In:CVPR. June 2025, pp. 22023–22034

2025
[29]

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

Yuchen Che et al. “COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation”. In:CVPR. 2026, pp. 11567–11578

2026
[30]

Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration

Jian Liu et al. “Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration”. In:IEEE Transactions on Robotics 42 (2026), pp. 1884–1901

2026
[31]

Nope: Novel object pose estimation from a single image

Van Nguyen Nguyen et al. “Nope: Novel object pose estimation from a single image”. In:CVPR. 2024, pp. 17923–17932

2024
[32]

Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

Zhiwen Fan et al. “Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference”. In:CVPR Workshops. 2024

2024
[33]

Open-vocabulary object 6D pose estimation

Jaime Corsetti et al. “Open-vocabulary object 6D pose estimation”. In:CVPR. 2024

2024
[34]

High-Resolution Open-V ocabulary Object 6D Pose Estimation

Jaime Corsetti et al. “High-Resolution Open-V ocabulary Object 6D Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence48.2 (2026), pp. 2066–2077

2026
[35]

One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu et al. “One2Any: One-Reference 6D Pose Estimation for Any Object”. In:CVPR. 2025, pp. 6457–6467

2025
[36]

Any6D: Model-free 6D pose estimation of novel objects

Taeyeop Lee et al. “Any6D: Model-free 6D pose estimation of novel objects”. In:CVPR. 2025, pp. 11633–11643

2025
[37]

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Dexin Zuo et al. “CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation”. In: AAAI. V ol. 40. 16. 2026, pp. 14122–14130

2026
[38]

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang et al. “ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors”. In:CVPR. 2026, pp. 26582– 26592

2026
[39]

Dinov2: Learning robust visual features without supervision

Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)

Pith/arXiv arXiv 2023
[40]

“Dinov3”

Oriane Sim ´eoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104 (2025)

Pith/arXiv arXiv 2025
[41]

Geotransformer: Fast and robust point cloud registration with geometric transformer

Zheng Qin et al. “Geotransformer: Fast and robust point cloud registration with geometric transformer”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2023)

2023
[42]

BOP: Benchmark for 6D Object Pose Estima- tion

Tomas Hodan et al. “BOP: Benchmark for 6D Object Pose Estima- tion”. In:ECCV. 2018, pp. 19–34

2018
[43]

Learning 6D object pose estimation using 3D object coordinates

Eric Brachmann et al. “Learning 6D object pose estimation using 3D object coordinates”. In:ECCV. 2014

2014
[44]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

Stefan Hinterstoisser et al. “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes”. In: ACCV. Springer. 2012, pp. 548–562

2012
[45]

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

Van Nguyen Nguyen et al. “GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence”. In:CVPR. 2024

2024
[46]

Segment anything

Alexander Kirillov et al. “Segment anything”. In:ICCV. 2023, pp. 4015–4026

2023
[47]

Onepose: One-shot object pose estimation without cad models

Jiaming Sun et al. “Onepose: One-shot object pose estimation without cad models”. In:CVPR. 2022, pp. 6825–6834

2022
[48]

Onepose++: Keypoint-free one-shot object pose estimation without CAD models

Xingyi He et al. “Onepose++: Keypoint-free one-shot object pose estimation without CAD models”. In:NeurIPS. 2022, pp. 35103– 35115

2022
[49]

Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images

Yuan Liu et al. “Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images”. In:ECCV. 2022

2022
[50]

Fs6d: Few-shot 6d pose estimation of novel objects

Yisheng He et al. “Fs6d: Few-shot 6d pose estimation of novel objects”. In:CVPR. 2022, pp. 6814–6824

2022
[51]

A Method for Registration of 3- D Shapes

Paul J. Besl and Neil D. McKay. “A Method for Registration of 3- D Shapes”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence14.2 (1992), pp. 239–256

1992
[52]

Efficient variants of the ICP algorithm

Szymon Rusinkiewicz and Marc Levoy. “Efficient variants of the ICP algorithm”. In:Proceedings third international conference on 3-D digital imaging and modeling. IEEE. 2001, pp. 145–152

2001
[53]

Fast point feature histograms (FPFH) for 3D registration

Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. “Fast point feature histograms (FPFH) for 3D registration”. In:ICRA. IEEE. 2009, pp. 3212–3217

2009
[54]

Model globally, match locally: Efficient and robust 3D object recognition

Bertram Drost et al. “Model globally, match locally: Efficient and robust 3D object recognition”. In:CVPR. Ieee. 2010, pp. 998–1005

2010
[55]

Predator: Registration of 3d point clouds with low overlap

Shengyu Huang et al. “Predator: Registration of 3d point clouds with low overlap”. In:CVPR. 2021, pp. 4267–4276

2021
[56]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun et al. “LoFTR: Detector-free local feature matching with transformers”. In:CVPR. 2021, pp. 8922–8931

2021
[57]

O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views

Lorenzo Mur-Labadia et al. “O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views”. In:ICCV. 2025

2025
[58]

V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan et al. “V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence”. In:arXiv preprint arXiv:2511.20886(2025)

Pith/arXiv arXiv 2025
[59]

Vggt: Visual geometry grounded transformer

Jianyuan Wang et al. “Vggt: Visual geometry grounded transformer”. In:CVPR. 2025, pp. 5294–5306

2025
[60]

PointNet: Deep learning on point sets for 3D classification and segmentation

Charles R Qi et al. “PointNet: Deep learning on point sets for 3D classification and segmentation”. In:CVPR. 2017

2017
[61]

Unsupervised semantic segmentation by dis- tilling feature correspondences

Mark Hamilton et al. “Unsupervised semantic segmentation by dis- tilling feature correspondences”. In:arXiv preprint arXiv:2203.08414 (2022)

arXiv 2022
[62]

Representa- tion learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representa- tion learning with contrastive predictive coding”. In:arXiv preprint arXiv:1807.03748(2018)

Pith/arXiv arXiv 2018
[63]

PyTorch: An Imperative Style, High-performance Deep Learning Library

Adam Paszke et al. “PyTorch: An Imperative Style, High-performance Deep Learning Library”. In:NeurIPS. 2019, pp. 8026–8037

2019
[64]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy et al. “An image is worth 16x16 words: Trans- formers for image recognition at scale”. In:ICLR. 2021

2021
[65]

BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects

Tomas Hodan et al. “BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects”. In:CVPR Workshops. 2024

2023
[66]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:ICLR. Ed. by Yoshua Bengio and Yann LeCun. 2015

2015
[67]

SGDR: Stochastic Gradient Descent with Warm Restarts

Frank Hutter Ilya Loshchilov. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In:ICLR. 2017

2017
[68]

On evaluation of 6D object pose estimation

Tom ´aˇs Hoda ˇn, Ji ˇr´ı Matas, and ˇStˇep´an Obdr ˇz´alek. “On evaluation of 6D object pose estimation”. In:ECCV. Springer. 2016, pp. 606–619

2016
[69]

3D Registration with Maximal Cliques

Xiyu Zhang et al. “3D Registration with Maximal Cliques”. In:CVPR. 2023, pp. 17745–17754

2023
[70]

Fully convolu- tional geometric features

Christopher Choy, Jaesik Park, and Vladlen Koltun. “Fully convolu- tional geometric features”. In:ICCV. 2019, pp. 8958–8966

2019
[71]

UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration

Zhilei Chen et al. “UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration”. In:Computer Graphics Forum41 (2022), pp. 87–98

2022
[72]

Learning general and distinctive 3D local deep descriptors for point cloud registration

Fabio Poiesi and Davide Boscaini. “Learning general and distinctive 3D local deep descriptors for point cloud registration”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence45.3 (2022), pp. 3979–3985

2022
[73]

FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models

Andrea Caraffa et al. “FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models”. In:ECCV. 2024

2024
[74]

Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment

Jianyuan Wang, Christian Rupprecht, and David Novotny. “Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment”. In:ICCV. 2023, pp. 9773–9783

2023
[75]

Relpose++: Recovering 6d poses from sparse-view observations

Amy Lin et al. “Relpose++: Recovering 6d poses from sparse-view observations”. In:3DV. 2024

2024
[76]

Object recognition from local scale-invariant fea- tures

David G Lowe. “Object recognition from local scale-invariant fea- tures”. In:ICCV. V ol. 2. IEEE. 1999, pp. 1150–1157

1999
[77]

Objectmatch: Ro- bust registration using canonical object correspondences

Can G ¨umeli, Angela Dai, and Matthias Nießner. “Objectmatch: Ro- bust registration using canonical object correspondences”. In:CVPR. 2023, pp. 13082–13091

2023

[1] [1]

A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms

Oliver Kroemer, Scott Niekum, and George Konidaris. “A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms”. In:JMLR22.30 (2021), pp. 1–82

2021

[2] [2]

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Guoguang Du et al. “Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review”. In:Artificial Intelligence Review54.3 (2021), pp. 1677–1734

2021

[3] [3]

Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud

Jianwei Guo et al. “Efficient Center V oting for Object Detection and 6D Pose Estimation in 3D Point Cloud”. In:IEEE Transactions on Image Processing30 (2021), pp. 5072–5084

2021

[4] [4]

Domain-Translated 3D Object Pose Estimation

Christos Papaioannidis, Vasileios Mygdalis, and Ioannis Pitas. “Domain-Translated 3D Object Pose Estimation”. In:IEEE Trans- actions on Image Processing29 (2020), pp. 9279–9291

2020

[5] [5]

Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image

Yinyu Nie et al. “Total3DUnderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image”. In: CVPR. 2020, pp. 55–64

2020

[6] [6]

Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation

Siyuan Huang et al. “Cooperative holistic scene understanding: Uni- fying 3d object, layout, and camera pose estimation”. In:NeurIPS. 2018

2018

[7] [7]

Pose estimation for augmented reality: a hands-on survey

Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. “Pose estimation for augmented reality: a hands-on survey”. In:IEEE Transactions on Visualization and Computer Graphics22.12 (2015), pp. 2633–2651

2015

[8] [8]

Deep multi-state object pose estimation for augmented reality assembly

Yongzhi Su et al. “Deep multi-state object pose estimation for augmented reality assembly”. In:IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). 2019, pp. 222–227

2019

[9] [9]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Yu Xiang et al. “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes”. In:RSS. 2018

2018

[10] [10]

Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation

Yongzhi Su et al. “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation”. In:CVPR. 2022, pp. 6738–6748

2022

[11] [11]

PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation

Sida Peng et al. “PVNet: Pixel-Wise V oting Network for 6DoF Object Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence44.6 (2022), pp. 3212–3223

2022

[12] [12]

Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation

Yongliang Lin et al. “Resolving Symmetry Ambiguity in Correspondence-Based Methods for Instance-Level Object Pose Estimation”. In:IEEE Transactions on Image Processing34 (2025), pp. 1700–1711

2025

[13] [13]

Gdrnpp: A geometry-guided and fully learning- based object pose estimator

Xingyu Liu et al. “Gdrnpp: A geometry-guided and fully learning- based object pose estimator”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025

[14] [14]

Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera

Zibin Liu et al. “Line-Based 6-DoF Object Pose Estimation and Tracking With an Event Camera”. In:IEEE Transactions on Image Processing33 (2024), pp. 4765–4780

2024

[15] [15]

Normalized object coordinate space for category-level 6D object pose and size estimation

He Wang et al. “Normalized object coordinate space for category-level 6D object pose and size estimation”. In:CVPR. 2019, pp. 2642–2651. SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 13

2019

[16] [16]

6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning

Lu Zou et al. “6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-Based Instance Representation Learning”. In:IEEE Transactions on Image Processing31 (2022), pp. 6907–6921

2022

[17] [17]

Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds

Xiaolong Li et al. “Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds”. In: NeurIPS. 2021

2021

[18] [18]

Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation

Yamei Chen et al. “Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation”. In:CVPR. 2024, pp. 9959– 9969

2024

[19] [19]

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

Huan Ren et al. “ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation”. In:CVPR. 2026, pp. 14315–14324

2026

[20] [20]

Deep learning-based object pose estimation: A comprehensive survey

Jian Liu et al. “Deep learning-based object pose estimation: A comprehensive survey”. In:arXiv preprint arXiv:2405.07801(2024)

arXiv 2024

[21] [21]

Challenges for monocular 6-d object pose estimation in robotics

Stefan Thalhammer et al. “Challenges for monocular 6-d object pose estimation in robotics”. In:IEEE Transactions on Robotics40 (2024), pp. 4065–4084

2024

[22] [22]

Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation

Keunhong Park et al. “Latentfusion: End-to-end differentiable recon- struction and rendering for unseen object pose estimation”. In:CVPR. 2020, pp. 10710–10719

2020

[23] [23]

MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

Yann Labb ´e et al. “MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare”. In:CoRL. PMLR. 2023, pp. 715–725

2023

[24] [24]

FoundPose: Unseen Object Pose Estimation with Foundation Features

Evin Pınar ¨Ornek et al. “FoundPose: Unseen Object Pose Estimation with Foundation Features”. In:CVPR. 2024

2024

[25] [25]

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen et al. “FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects”. In:CVPR. 2024

2024

[26] [26]

Sam-6d: Segment anything model meets zero-shot 6d object pose estimation

Jiehong Lin et al. “Sam-6d: Segment anything model meets zero-shot 6d object pose estimation”. In:CVPR. 2024

2024

[27] [27]

PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting

Wubin Shi, Shaoyan Gai, and Feipeng Da. “PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting”. In:CVPR. 2026, pp. 4698–4707

2026

[28] [28]

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

Xingyu Liu et al. “UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image”. In:CVPR. June 2025, pp. 22023–22034

2025

[29] [29]

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

Yuchen Che et al. “COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation”. In:CVPR. 2026, pp. 11567–11578

2026

[30] [30]

Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration

Jian Liu et al. “Scalable Unseen Objects 6-DoF Absolute Pose Esti- mation with Robotic Integration”. In:IEEE Transactions on Robotics 42 (2026), pp. 1884–1901

2026

[31] [31]

Nope: Novel object pose estimation from a single image

Van Nguyen Nguyen et al. “Nope: Novel object pose estimation from a single image”. In:CVPR. 2024, pp. 17923–17932

2024

[32] [32]

Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

Zhiwen Fan et al. “Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference”. In:CVPR Workshops. 2024

2024

[33] [33]

Open-vocabulary object 6D pose estimation

Jaime Corsetti et al. “Open-vocabulary object 6D pose estimation”. In:CVPR. 2024

2024

[34] [34]

High-Resolution Open-V ocabulary Object 6D Pose Estimation

Jaime Corsetti et al. “High-Resolution Open-V ocabulary Object 6D Pose Estimation”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence48.2 (2026), pp. 2066–2077

2026

[35] [35]

One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu et al. “One2Any: One-Reference 6D Pose Estimation for Any Object”. In:CVPR. 2025, pp. 6457–6467

2025

[36] [36]

Any6D: Model-free 6D pose estimation of novel objects

Taeyeop Lee et al. “Any6D: Model-free 6D pose estimation of novel objects”. In:CVPR. 2025, pp. 11633–11643

2025

[37] [37]

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Dexin Zuo et al. “CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation”. In: AAAI. V ol. 40. 16. 2026, pp. 14122–14130

2026

[38] [38]

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang et al. “ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors”. In:CVPR. 2026, pp. 26582– 26592

2026

[39] [39]

Dinov2: Learning robust visual features without supervision

Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)

Pith/arXiv arXiv 2023

[40] [40]

“Dinov3”

Oriane Sim ´eoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104 (2025)

Pith/arXiv arXiv 2025

[41] [41]

Geotransformer: Fast and robust point cloud registration with geometric transformer

Zheng Qin et al. “Geotransformer: Fast and robust point cloud registration with geometric transformer”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence(2023)

2023

[42] [42]

BOP: Benchmark for 6D Object Pose Estima- tion

Tomas Hodan et al. “BOP: Benchmark for 6D Object Pose Estima- tion”. In:ECCV. 2018, pp. 19–34

2018

[43] [43]

Learning 6D object pose estimation using 3D object coordinates

Eric Brachmann et al. “Learning 6D object pose estimation using 3D object coordinates”. In:ECCV. 2014

2014

[44] [44]

Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

Stefan Hinterstoisser et al. “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes”. In: ACCV. Springer. 2012, pp. 548–562

2012

[45] [45]

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

Van Nguyen Nguyen et al. “GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence”. In:CVPR. 2024

2024

[46] [46]

Segment anything

Alexander Kirillov et al. “Segment anything”. In:ICCV. 2023, pp. 4015–4026

2023

[47] [47]

Onepose: One-shot object pose estimation without cad models

Jiaming Sun et al. “Onepose: One-shot object pose estimation without cad models”. In:CVPR. 2022, pp. 6825–6834

2022

[48] [48]

Onepose++: Keypoint-free one-shot object pose estimation without CAD models

Xingyi He et al. “Onepose++: Keypoint-free one-shot object pose estimation without CAD models”. In:NeurIPS. 2022, pp. 35103– 35115

2022

[49] [49]

Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images

Yuan Liu et al. “Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images”. In:ECCV. 2022

2022

[50] [50]

Fs6d: Few-shot 6d pose estimation of novel objects

Yisheng He et al. “Fs6d: Few-shot 6d pose estimation of novel objects”. In:CVPR. 2022, pp. 6814–6824

2022

[51] [51]

A Method for Registration of 3- D Shapes

Paul J. Besl and Neil D. McKay. “A Method for Registration of 3- D Shapes”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence14.2 (1992), pp. 239–256

1992

[52] [52]

Efficient variants of the ICP algorithm

Szymon Rusinkiewicz and Marc Levoy. “Efficient variants of the ICP algorithm”. In:Proceedings third international conference on 3-D digital imaging and modeling. IEEE. 2001, pp. 145–152

2001

[53] [53]

Fast point feature histograms (FPFH) for 3D registration

Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. “Fast point feature histograms (FPFH) for 3D registration”. In:ICRA. IEEE. 2009, pp. 3212–3217

2009

[54] [54]

Model globally, match locally: Efficient and robust 3D object recognition

Bertram Drost et al. “Model globally, match locally: Efficient and robust 3D object recognition”. In:CVPR. Ieee. 2010, pp. 998–1005

2010

[55] [55]

Predator: Registration of 3d point clouds with low overlap

Shengyu Huang et al. “Predator: Registration of 3d point clouds with low overlap”. In:CVPR. 2021, pp. 4267–4276

2021

[56] [56]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun et al. “LoFTR: Detector-free local feature matching with transformers”. In:CVPR. 2021, pp. 8922–8931

2021

[57] [57]

O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views

Lorenzo Mur-Labadia et al. “O-MaMa: Learning Object Mask Match- ing between Egocentric and Exocentric Views”. In:ICCV. 2025

2025

[58] [58]

V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan et al. “V 2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence”. In:arXiv preprint arXiv:2511.20886(2025)

Pith/arXiv arXiv 2025

[59] [59]

Vggt: Visual geometry grounded transformer

Jianyuan Wang et al. “Vggt: Visual geometry grounded transformer”. In:CVPR. 2025, pp. 5294–5306

2025

[60] [60]

PointNet: Deep learning on point sets for 3D classification and segmentation

Charles R Qi et al. “PointNet: Deep learning on point sets for 3D classification and segmentation”. In:CVPR. 2017

2017

[61] [61]

Unsupervised semantic segmentation by dis- tilling feature correspondences

Mark Hamilton et al. “Unsupervised semantic segmentation by dis- tilling feature correspondences”. In:arXiv preprint arXiv:2203.08414 (2022)

arXiv 2022

[62] [62]

Representa- tion learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representa- tion learning with contrastive predictive coding”. In:arXiv preprint arXiv:1807.03748(2018)

Pith/arXiv arXiv 2018

[63] [63]

PyTorch: An Imperative Style, High-performance Deep Learning Library

Adam Paszke et al. “PyTorch: An Imperative Style, High-performance Deep Learning Library”. In:NeurIPS. 2019, pp. 8026–8037

2019

[64] [64]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy et al. “An image is worth 16x16 words: Trans- formers for image recognition at scale”. In:ICLR. 2021

2021

[65] [65]

BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects

Tomas Hodan et al. “BOP Challenge 2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects”. In:CVPR Workshops. 2024

2023

[66] [66]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:ICLR. Ed. by Yoshua Bengio and Yann LeCun. 2015

2015

[67] [67]

SGDR: Stochastic Gradient Descent with Warm Restarts

Frank Hutter Ilya Loshchilov. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In:ICLR. 2017

2017

[68] [68]

On evaluation of 6D object pose estimation

Tom ´aˇs Hoda ˇn, Ji ˇr´ı Matas, and ˇStˇep´an Obdr ˇz´alek. “On evaluation of 6D object pose estimation”. In:ECCV. Springer. 2016, pp. 606–619

2016

[69] [69]

3D Registration with Maximal Cliques

Xiyu Zhang et al. “3D Registration with Maximal Cliques”. In:CVPR. 2023, pp. 17745–17754

2023

[70] [70]

Fully convolu- tional geometric features

Christopher Choy, Jaesik Park, and Vladlen Koltun. “Fully convolu- tional geometric features”. In:ICCV. 2019, pp. 8958–8966

2019

[71] [71]

UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration

Zhilei Chen et al. “UTOPIC: Uncertainty-aware Overlap Prediction Network for Partial Point Cloud Registration”. In:Computer Graphics Forum41 (2022), pp. 87–98

2022

[72] [72]

Learning general and distinctive 3D local deep descriptors for point cloud registration

Fabio Poiesi and Davide Boscaini. “Learning general and distinctive 3D local deep descriptors for point cloud registration”. In:IEEE Transactions on Pattern Analysis and Machine Intelligence45.3 (2022), pp. 3979–3985

2022

[73] [73]

FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models

Andrea Caraffa et al. “FreeZe: Training-free Zero-shot 6D Pose Es- timation with Geometric and Vision Foundation Models”. In:ECCV. 2024

2024

[74] [74]

Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment

Jianyuan Wang, Christian Rupprecht, and David Novotny. “Posediffu- sion: Solving pose estimation via diffusion-aided bundle adjustment”. In:ICCV. 2023, pp. 9773–9783

2023

[75] [75]

Relpose++: Recovering 6d poses from sparse-view observations

Amy Lin et al. “Relpose++: Recovering 6d poses from sparse-view observations”. In:3DV. 2024

2024

[76] [76]

Object recognition from local scale-invariant fea- tures

David G Lowe. “Object recognition from local scale-invariant fea- tures”. In:ICCV. V ol. 2. IEEE. 1999, pp. 1150–1157

1999

[77] [77]

Objectmatch: Ro- bust registration using canonical object correspondences

Can G ¨umeli, Angela Dai, and Matthias Nießner. “Objectmatch: Ro- bust registration using canonical object correspondences”. In:CVPR. 2023, pp. 13082–13091

2023