arxiv: 2512.09056 · v3 · submitted 2025-12-09 · 💻 cs.CV

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang , Yordanka Velikova , Mahdi Saleh , Jan-Nico Zaech , Danda Pani Paudel , Benjamin Busam This is my paper

Pith reviewed 2026-05-16 23:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords object pose estimationzero-shot learningvision-language models6DoF posetraining-freeconcept vectorsrelative posesaliency maps

0 comments

The pith

Concept vectors from vision-language model saliency maps enable training-free 6DoF relative pose estimation by matching 3D-3D correspondences across views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ConceptPose as a way to recover the 6DoF pose between two views of an object without any training on that object or on a dataset of similar objects. It extracts concept vectors at each 3D point by querying a vision-language model with saliency maps, then uses those vectors to identify matching points between the two views and solve for the rigid transformation. This produces accurate pose estimates on standard zero-shot benchmarks and outperforms the best competing approach, even when the competitor uses extensive dataset-specific training. The result matters because it shows that general-purpose language-vision models already contain enough geometric structure to support precise 3D alignment without per-task fine-tuning or 3D model supervision.

Core claim

ConceptPose builds open-vocabulary 3D concept maps in which every surface point receives a concept vector derived from VLM saliency responses; robust 3D-3D correspondences are then found by matching these vectors across two views, allowing direct recovery of the relative 6DoF pose. The method requires no object-specific training, no 3D models, and no dataset collection, yet reports a 62% relative gain in average ADD(-S) score over the strongest prior baseline on common zero-shot pose benchmarks.

What carries the argument

Concept vectors extracted from VLM saliency maps, which tag 3D points with semantic descriptors used to establish repeatable 3D-3D correspondences for pose solving.

If this is right

Any object whose parts can be described in natural language becomes a candidate for zero-shot pose recovery.
Pose estimation pipelines no longer need separate training stages or 3D mesh acquisition for each new object class.
Relative pose can be computed directly from two RGB images using only a frozen VLM and standard correspondence solvers.
Performance scales with improvements in the underlying vision-language model rather than with additional pose-specific data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concept-vector representation may support downstream tasks such as part-based manipulation or object-centric SLAM without retraining.
Failure cases are likely to appear on objects whose surface regions produce nearly identical saliency responses, suggesting targeted prompt engineering as a practical extension.
Because the method never sees 3D geometry during training, it offers a route to pose estimation in domains where 3D ground truth is expensive or impossible to obtain.

Load-bearing premise

The concept vectors produced by VLM saliency maps remain sufficiently consistent across different viewpoints of the same unseen object to yield reliable 3D point matches.

What would settle it

An experiment showing that, for a set of object pairs with known ground-truth poses, the nearest-neighbor matches based on concept vectors produce pose errors larger than random guessing on more than half the pairs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.09056 by Benjamin Busam, Danda Pani Paudel, Jan-Nico Zaech, Liming Kuang, Mahdi Saleh, Yordanka Velikova.

**Figure 1.** Figure 1: From language concepts to 6D Pose: ConceptPose uses language-driven concepts to create 3D concept maps and match them across views for training-free 6D pose estimation. To overcome this rigidity, the community has increasingly focused on category-level [22, 50] and zero shot relative pose estimation [7, 23, 26, 30, 32], which aim to handle previously unseen objects. In parallel, a powerful new trend has … view at source ↗

**Figure 2.** Figure 2: Overview of the ConceptPose pipeline for zero shot relative pose estimation. Given an anchor-query RGB-D pair and category [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results of ConceptPose’s zero shot relative pose estimation on REAL275, Toyota-Light, and YCB-Video. For each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on concepts quantity on TYOL dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of performance changes across [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results (1/2), estimated query pose, ground truth anchor/query pose [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results (2/2), estimated query pose, ground truth anchor/query pose [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConceptPose gives a clean training-free route to zero-shot 6DoF pose via VLM saliency turned into 3D concept vectors, but the correspondence step still needs direct evidence on repeatability.

read the letter

The paper's main contribution is a pipeline that extracts open-vocabulary concept vectors from VLM saliency maps, lifts them to 3D, and solves relative pose through standard 3D-3D matching. It reports a 62% relative gain in ADD(-S) on zero-shot benchmarks without any object-specific training. That construction itself is new relative to the cited baselines and avoids the usual dataset collection step, which is the practical hook for robotics and AR work. The approach stays lightweight by reusing an off-the-shelf VLM and existing correspondence solvers, which keeps the method simple to implement if the numbers hold. The central weakness is exactly the one flagged in the stress-test note: there is no reported data on how stable the saliency-derived vectors are across viewpoints, what the inlier rates look like, or how the method behaves on textureless or symmetric objects where VLM attention tends to be inconsistent. Without those checks or ablations, the large benchmark jump cannot yet be pinned on the proposed mechanism rather than on the particular test sets or VLM backbone. The math and citation pattern look standard; nothing circular or invented is visible from the abstract. This is worth a reading group discussion for anyone working on reducing training data in 3D vision. It deserves peer review because the idea is straightforward, the claimed gains are large, and a referee can ask for the missing robustness numbers without the paper being incoherent on its own terms.

Referee Report

3 major / 2 minor

Summary. The paper introduces ConceptPose, a training-free zero-shot method for 6DoF object pose estimation. It uses a pretrained VLM to generate open-vocabulary 3D concept maps by tagging points with concept vectors derived from 2D saliency maps, then solves for relative pose via 3D-3D correspondences and standard solvers. The central claim is that this achieves state-of-the-art results on common zero-shot relative pose benchmarks, outperforming the strongest baseline by 62% relative in average ADD(-S), including methods that use extensive dataset-specific training.

Significance. If the central claim holds with robust evidence, the result would be significant: it would show that VLM saliency-derived concept vectors can produce sufficiently repeatable 3D correspondences for accurate pose recovery on unseen objects, eliminating the need for object- or dataset-specific training and outperforming supervised baselines. This could shift practice in robotics and vision toward purely zero-shot pipelines.

major comments (3)

[Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.
[Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.
[Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.

minor comments (2)

[Abstract] The term 'model-free' in the abstract is potentially misleading given the reliance on a pretrained VLM; clarify the intended meaning.
Add references to prior VLM-based correspondence or zero-shot pose work to better situate the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify areas for improved clarity or additional evidence, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.

Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we have expanded the abstract to name the evaluation benchmarks, distinguish zero-shot from trained baselines, report absolute ADD(-S) scores in addition to the relative improvement, and briefly reference the ablations that isolate the concept-vector contribution. Full ablation tables and baseline lists remain in the main text to respect abstract length constraints. revision: yes
Referee: [Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.

Authors: We acknowledge that explicit correspondence-quality metrics would strengthen attribution of the performance gains. The revised manuscript now includes a dedicated quantitative analysis section reporting inlier rates, viewpoint-repeatability scores, and failure-mode breakdowns on textureless and symmetric objects. These results are presented alongside the end-to-end pose metrics to isolate the contribution of the concept vectors. revision: yes
Referee: [Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.

Authors: We agree that additional algorithmic transparency is warranted. Although the original manuscript describes the lifting procedure in Section 3, the revised version adds explicit pseudocode for the 2D-to-3D concept-map construction and expands the text explaining per-point vector computation and viewpoint-invariance via aggregated VLM features. These changes make the density and stability of the resulting correspondences easier to verify. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents ConceptPose as a training-free method that extracts concept vectors from an external pretrained VLM's saliency maps and then applies standard 3D-3D correspondence solvers for 6DoF pose. No equations or steps reduce the reported pose accuracy to quantities that are fitted or defined inside the paper itself; the central mechanism relies on the external VLM's zero-shot properties and off-the-shelf correspondence techniques rather than any self-referential construction, fitted input renamed as prediction, or load-bearing self-citation chain. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes a pretrained VLM produces concept vectors whose 3D projections yield reliable correspondences; no free parameters are mentioned in the abstract, but the saliency-to-vector mapping and correspondence threshold are implicit choices.

axioms (1)

domain assumption VLM saliency maps can be lifted to stable 3D concept vectors that support accurate 3D-3D matching across views
Central to the 6DoF recovery step described in the abstract

pith-pipeline@v0.9.0 · 5478 in / 1247 out tokens · 23376 ms · 2026-05-16T23:40:14.530246+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
cs.CV 2026-03 unverdicted novelty 6.0

TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis

Eduardo Arnold, Omar Y . Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applica- tions.IEEE Transactions on Intelligent Transportation Sys- tems, 20(10):3782–3795, 2019. 1

work page 2019
[2]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025– 22035, 2025. 3

work page 2025
[3]

GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting

Dingding Cai, Janne Heikkila, and Esa Rahtu. GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting . In2025 International Con- ference on 3D Vision (3DV), pages 1001–1011, Los Alami- tos, CA, USA, 2025. IEEE Computer Society. 2

work page 2025
[4]

Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models

Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 2

work page 2024
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 2

work page 2021
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Open-vocabulary object 6d pose estimation.IEEE/CVF Computer Vision and Pattern Recog- nition Conference (CVPR), 2024

Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cav- allaro, and Fabio Poiesi. Open-vocabulary object 6d pose estimation.IEEE/CVF Computer Vision and Pattern Recog- nition Conference (CVPR), 2024. 1, 2, 3, 5, 6

work page 2024
[8]

High-resolution open-vocabulary object 6d pose estimation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Chang- jae Oh, Andrea Cavallaro, and Fabio Poiesi. High-resolution open-vocabulary object 6d pose estimation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 1, 2, 3, 6

work page 2025
[9]

Open- vocabulary universal image segmentation with MaskCLIP

Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. InProceedings of the 40th International Conference on Ma- chine Learning, pages 8090–8102. PMLR, 2023. 3

work page 2023
[10]

Hugo Durchon, Marius Preda, and Titus B. Zaharia. 6d pose estimation of unseen objects for industrial augmented real- ity.2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP), pages 1– 8, 2024. 1

work page 2024
[11]

POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Dejia Xu, Hanwen Jiang, and Zhangyang Wang. POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference. InCVPRW, pages 2692–2701, 2024. 2, 3

work page 2024
[12]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4

work page 1981
[13]

A survey of 6d object detection based on 3d models for in- dustrial applications.Journal of Imaging, 8(3), 2022

Felix Gorschl ¨uter, Pavel Rojtberg, and Thomas P ¨ollabauer. A survey of 6d object detection based on 3d models for in- dustrial applications.Journal of Imaging, 8(3), 2022. 1

work page 2022
[14]

Object- Match: Robust Registration using Canonical Object Corre- spondences

Can G ¨umeli, Angela Dai, and Matthias Nießner. Object- Match: Robust Registration using Canonical Object Corre- spondences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13082– 13091, 2023. 2, 6

work page 2023
[15]

OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models

Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models. InAd- vances in Neural Information Processing Systems, pages 35103–35115, 2022. 2

work page 2022
[16]

FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects

Yisheng He, Yao Wang, Haoqiang Fan, Qifeng Chen, and Jian Sun. FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6814–6824,

work page
[17]

Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes

Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes. In Computer Vision – ACCV 2012, pages 548–562. Springer,

work page 2012
[18]

BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),

Tom ´aˇs Hodaˇn, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Ji ˇr´ı Matas, and Carsten Rother. BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),

work page
[19]

Bop: Benchmark for 6d object pose estima- tion

Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Jiri Matas, and Carsten Rother. Bop: Benchmark for 6d object pose estima- tion. InProceedings of the European Conference on Com- puter Vision (ECCV)...

work page 2018
[20]

MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images

Junwen Huang, Hao Yu, Kuan-Ting Yu, Nassir Navab, Slo- bodan Ilic, and Benjamin Busam. MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10095–10105, 2024. 2

work page 2024
[21]

V o, Patrick Labatut, and Piotr Bojanowski

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha¨el Ramamonjisoa, Maxime Oquab, Oriane Sim´eoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and...

work page 2025
[22]

Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios

HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp, Guangyao Zhai, Hannah Schieber, Giulia Rizzoli, Pengyuan Wang, Hongcheng Zhao, Lorenzo Garattoni, Sven Meier, Daniel Roth, Nassir Navab, and Benjamin Busam. Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios. InProceedings of the IEE...

work page 2024
[23]

Any6D: Model-free 6d pose estimation of novel objects

Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6d pose estimation of novel objects. InProceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR),

work page
[24]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3

work page 2022
[25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[26]

UA- Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References

Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, and Cheng-Hao Kuo. UA- Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1180–1189, 2025. 1, 2, 7, 8

work page 2025
[27]

Decap: Decoding clip latents for zero-shot captioning via text-only training

Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. Decap: Decoding clip latents for zero-shot captioning via text-only training. InThe Eleventh International Conference on Learn- ing Representations, 2023. 1

work page 2023
[28]

Gce-pose: Global context enhancement for category-level object pose estimation

Weihang Li, Hongli Xu, Junwen Huang, Hyunjun Jung, Pe- ter KT Yu, Nassir Navab, and Benjamin Busam. Gce-pose: Global context enhancement for category-level object pose estimation. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 27154–27165,

work page
[29]

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559– 9568, 2024. 2, 8

work page 2024
[30]

One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu, Siyuan Li, Ajad Chhatkuli, Prune Truong, Luc Van Gool, and Federico Tombari. One2Any: One-Reference 6D Pose Estimation for Any Object. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6457–6467, 2025. 1, 2, 6

work page 2025
[31]

Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 3

work page 2024
[32]

Unopose: Un- seen object pose estimation with an unposed rgb-d reference image

Xingyu Liu, Gu Wang, Ruida Zhang, Chenyangguang Zhang, Federico Tombari, and Xiangyang Ji. Unopose: Un- seen object pose estimation with an unposed rgb-d reference image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22023–22034, 2025. 1, 2

work page 2025
[33]

Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images

Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022. 2, 8

work page 2022
[34]

David G. Lowe. Distinctive Image Features from Scale- Invariant Keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 6

work page 2004
[35]

NOPE: Novel Object Pose Estimation from a Sin- gle Image

Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Yinlin Hu, Renaud Marlet, Mathieu Salzmann, and Vincent Lepetit. NOPE: Novel Object Pose Estimation from a Sin- gle Image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6305– 6314, 2024. 2

work page 2024
[36]

Gigapose: Fast and robust novel object pose estimation via one correspondence

Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9903–9913, 2024. 2

work page 2024
[37]

DINOv2: Learning Robust Visual Fea- tures without Supervision.Transactions on Machine Learn- ing Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao Xu, Herv´e J´egou, Julien Maira...

work page 2024
[38]

Found- pose: Unseen object pose estimation with foundation fea- tures

Evin Pınar ¨Ornek, Yann Labb ´e, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, and Tomas Hodan. Found- pose: Unseen object pose estimation with foundation fea- tures. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024. 2, 3

work page 2024
[39]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3

work page 2021
[40]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. arXiv preprint arXiv:2401.14159, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Com- puter Vision, 128(2), 2019. 3, 1

work page 2019
[42]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Poulami Sinhamahapatra, Franziska Schwaiger, Shirsha Bose, Huiyu Wang, Karsten Roscher, and Stephan Guen- nemann. Finding dino: A plug-and-play framework for zero-shot detection of out-of-distribution objects using pro- totypes.2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8474–8483, 2024. 1

work page 2025
[44]

Deep multi-state object pose estimation for augmented reality assembly

Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. Deep multi-state object pose estimation for augmented reality assembly. In2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pages 222–227, 2019. 1

work page 2019
[45]

ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation

Yongzhi Su, Mahdi Saleh, Torben Fetzer, Jason Rambach, Nassir Navab, Benjamin Busam, Didier Stricker, and Fed- erico Tombari. ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6728–6737, 2022. 2

work page 2022
[46]

Onepose: One-shot object pose estimation without cad mod- els

Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. Onepose: One-shot object pose estimation without cad mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6825– 6834, 2022. 2

work page 2022
[47]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understand- ing, Localization, and Dense Featu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989

Shinji Umeyama. A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989. 4

work page 1989
[49]

Densefusion: 6d object pose estimation by iterative dense fusion

Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 1

work page 2019
[50]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object co- ordinate space for category-level 6d object pose and size esti- mation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2637–2646, 2019. 1, 2, 4, 8

work page 2019
[51]

Catgrasp: Learning category-level task-relevant grasping in clutter from simulation

Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. In2022 Interna- tional Conference on Robotics and Automation (ICRA), page 6401–6408. IEEE Press, 2022. 1

work page 2022
[52]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2024. 1, 2, 7, 8

work page 2024
[53]

Clip-dinoiser: Teaching clip a few dino tricks

Monika Wysocza’nska, Oriane Sim´eoni, Michael Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci’nski, and Patrick P’erez. Clip-dinoiser: Teaching clip a few dino tricks. InEuropean Conference on Computer Vision, 2023. 1

work page 2023
[54]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems, 2018. 1, 2, 4, 8

work page 2018
[55]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 3 ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors Supplementary Material

work page 2023
[56]

Pretrained heads - Zero-shot tasks with dino.txt

Additional ablations 6.1. Backbone Ablation To demonstrate ConceptPose’s adaptability across different vision-language architectures, we evaluate five backbones on REAL275 (Table 5): three SigLIP2 [47] variants (giant- 384, large-384, base-384), CLIP ViT-L/14@336px [39], and DINOv3-L/16 [42] with the dinotxt text grounding head [21]. Our baseline method i...

work page 2048
[57]

Concepts used for REAL275 evaluation

Performance Analysis by Object Table 8. Concepts used for REAL275 evaluation. Cat. #L Concepts bottle 20 cap, lid, spout, nozzle, pump, trigger, body, base, neck, shoulder, label, handle, grip, threads, tam- per evident ring, overcap, sleeve, collar, punt, carry loop bowl 20 body, rim, bottom, foot, handle, lid, spout, ear, pedestal, inner surface, outer ...

work page
[58]

Extended Qualitative Results Figures 6 and 7 present 10 extended qualitative visualiza- tions of ConceptPose across all four evaluation datasets (REAL275, Toyota-Light, YCB-Video, LINEMOD), with saliency maps and estimated poses. It demonstrates Con- ceptPose’s ability to generate semantically meaningful con- cept activations and accurate pose estimates a...

work page