pith. machine review for the scientific record. sign in

arxiv: 2512.09056 · v3 · submitted 2025-12-09 · 💻 cs.CV

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Pith reviewed 2026-05-16 23:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords object pose estimationzero-shot learningvision-language models6DoF posetraining-freeconcept vectorsrelative posesaliency maps
0
0 comments X

The pith

Concept vectors from vision-language model saliency maps enable training-free 6DoF relative pose estimation by matching 3D-3D correspondences across views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ConceptPose as a way to recover the 6DoF pose between two views of an object without any training on that object or on a dataset of similar objects. It extracts concept vectors at each 3D point by querying a vision-language model with saliency maps, then uses those vectors to identify matching points between the two views and solve for the rigid transformation. This produces accurate pose estimates on standard zero-shot benchmarks and outperforms the best competing approach, even when the competitor uses extensive dataset-specific training. The result matters because it shows that general-purpose language-vision models already contain enough geometric structure to support precise 3D alignment without per-task fine-tuning or 3D model supervision.

Core claim

ConceptPose builds open-vocabulary 3D concept maps in which every surface point receives a concept vector derived from VLM saliency responses; robust 3D-3D correspondences are then found by matching these vectors across two views, allowing direct recovery of the relative 6DoF pose. The method requires no object-specific training, no 3D models, and no dataset collection, yet reports a 62% relative gain in average ADD(-S) score over the strongest prior baseline on common zero-shot pose benchmarks.

What carries the argument

Concept vectors extracted from VLM saliency maps, which tag 3D points with semantic descriptors used to establish repeatable 3D-3D correspondences for pose solving.

If this is right

  • Any object whose parts can be described in natural language becomes a candidate for zero-shot pose recovery.
  • Pose estimation pipelines no longer need separate training stages or 3D mesh acquisition for each new object class.
  • Relative pose can be computed directly from two RGB images using only a frozen VLM and standard correspondence solvers.
  • Performance scales with improvements in the underlying vision-language model rather than with additional pose-specific data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concept-vector representation may support downstream tasks such as part-based manipulation or object-centric SLAM without retraining.
  • Failure cases are likely to appear on objects whose surface regions produce nearly identical saliency responses, suggesting targeted prompt engineering as a practical extension.
  • Because the method never sees 3D geometry during training, it offers a route to pose estimation in domains where 3D ground truth is expensive or impossible to obtain.

Load-bearing premise

The concept vectors produced by VLM saliency maps remain sufficiently consistent across different viewpoints of the same unseen object to yield reliable 3D point matches.

What would settle it

An experiment showing that, for a set of object pairs with known ground-truth poses, the nearest-neighbor matches based on concept vectors produce pose errors larger than random guessing on more than half the pairs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.09056 by Benjamin Busam, Danda Pani Paudel, Jan-Nico Zaech, Liming Kuang, Mahdi Saleh, Yordanka Velikova.

Figure 1
Figure 1. Figure 1: From language concepts to 6D Pose: ConceptPose uses language-driven concepts to create 3D concept maps and match them across views for training-free 6D pose estimation. To overcome this rigidity, the community has increas￾ingly focused on category-level [22, 50] and zero shot rela￾tive pose estimation [7, 23, 26, 30, 32], which aim to handle previously unseen objects. In parallel, a powerful new trend has … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ConceptPose pipeline for zero shot relative pose estimation. Given an anchor-query RGB-D pair and category [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of ConceptPose’s zero shot relative pose estimation on REAL275, Toyota-Light, and YCB-Video. For each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on concepts quantity on TYOL dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of performance changes across [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results (1/2), estimated query pose, ground truth anchor/query pose [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results (2/2), estimated query pose, ground truth anchor/query pose [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ConceptPose, a training-free zero-shot method for 6DoF object pose estimation. It uses a pretrained VLM to generate open-vocabulary 3D concept maps by tagging points with concept vectors derived from 2D saliency maps, then solves for relative pose via 3D-3D correspondences and standard solvers. The central claim is that this achieves state-of-the-art results on common zero-shot relative pose benchmarks, outperforming the strongest baseline by 62% relative in average ADD(-S), including methods that use extensive dataset-specific training.

Significance. If the central claim holds with robust evidence, the result would be significant: it would show that VLM saliency-derived concept vectors can produce sufficiently repeatable 3D correspondences for accurate pose recovery on unseen objects, eliminating the need for object- or dataset-specific training and outperforming supervised baselines. This could shift practice in robotics and vision toward purely zero-shot pipelines.

major comments (3)
  1. [Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.
  2. [Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.
  3. [Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.
minor comments (2)
  1. [Abstract] The term 'model-free' in the abstract is potentially misleading given the reliance on a pretrained VLM; clarify the intended meaning.
  2. Add references to prior VLM-based correspondence or zero-shot pose work to better situate the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify areas for improved clarity or additional evidence, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.

    Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we have expanded the abstract to name the evaluation benchmarks, distinguish zero-shot from trained baselines, report absolute ADD(-S) scores in addition to the relative improvement, and briefly reference the ablations that isolate the concept-vector contribution. Full ablation tables and baseline lists remain in the main text to respect abstract length constraints. revision: yes

  2. Referee: [Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.

    Authors: We acknowledge that explicit correspondence-quality metrics would strengthen attribution of the performance gains. The revised manuscript now includes a dedicated quantitative analysis section reporting inlier rates, viewpoint-repeatability scores, and failure-mode breakdowns on textureless and symmetric objects. These results are presented alongside the end-to-end pose metrics to isolate the contribution of the concept vectors. revision: yes

  3. Referee: [Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.

    Authors: We agree that additional algorithmic transparency is warranted. Although the original manuscript describes the lifting procedure in Section 3, the revised version adds explicit pseudocode for the 2D-to-3D concept-map construction and expands the text explaining per-point vector computation and viewpoint-invariance via aggregated VLM features. These changes make the density and stability of the resulting correspondences easier to verify. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents ConceptPose as a training-free method that extracts concept vectors from an external pretrained VLM's saliency maps and then applies standard 3D-3D correspondence solvers for 6DoF pose. No equations or steps reduce the reported pose accuracy to quantities that are fitted or defined inside the paper itself; the central mechanism relies on the external VLM's zero-shot properties and off-the-shelf correspondence techniques rather than any self-referential construction, fitted input renamed as prediction, or load-bearing self-citation chain. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes a pretrained VLM produces concept vectors whose 3D projections yield reliable correspondences; no free parameters are mentioned in the abstract, but the saliency-to-vector mapping and correspondence threshold are implicit choices.

axioms (1)
  • domain assumption VLM saliency maps can be lifted to stable 3D concept vectors that support accurate 3D-3D matching across views
    Central to the 6DoF recovery step described in the abstract

pith-pipeline@v0.9.0 · 5478 in / 1247 out tokens · 23376 ms · 2026-05-16T23:40:14.530246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

    cs.CV 2026-03 unverdicted novelty 6.0

    TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis

    Eduardo Arnold, Omar Y . Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applica- tions.IEEE Transactions on Intelligent Transportation Sys- tems, 20(10):3782–3795, 2019. 1

  2. [2]

    Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation

    Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025– 22035, 2025. 3

  3. [3]

    GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting

    Dingding Cai, Janne Heikkila, and Esa Rahtu. GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting . In2025 International Con- ference on 3D Vision (3DV), pages 1001–1011, Los Alami- tos, CA, USA, 2025. IEEE Computer Society. 2

  4. [4]

    Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models

    Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 2

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 2

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

  7. [7]

    Open-vocabulary object 6d pose estimation.IEEE/CVF Computer Vision and Pattern Recog- nition Conference (CVPR), 2024

    Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cav- allaro, and Fabio Poiesi. Open-vocabulary object 6d pose estimation.IEEE/CVF Computer Vision and Pattern Recog- nition Conference (CVPR), 2024. 1, 2, 3, 5, 6

  8. [8]

    High-resolution open-vocabulary object 6d pose estimation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

    Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Chang- jae Oh, Andrea Cavallaro, and Fabio Poiesi. High-resolution open-vocabulary object 6d pose estimation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 1, 2, 3, 6

  9. [9]

    Open- vocabulary universal image segmentation with MaskCLIP

    Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. InProceedings of the 40th International Conference on Ma- chine Learning, pages 8090–8102. PMLR, 2023. 3

  10. [10]

    Hugo Durchon, Marius Preda, and Titus B. Zaharia. 6d pose estimation of unseen objects for industrial augmented real- ity.2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP), pages 1– 8, 2024. 1

  11. [11]

    POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

    Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Dejia Xu, Hanwen Jiang, and Zhangyang Wang. POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference. InCVPRW, pages 2692–2701, 2024. 2, 3

  12. [12]

    Fischler and Robert C

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4

  13. [13]

    A survey of 6d object detection based on 3d models for in- dustrial applications.Journal of Imaging, 8(3), 2022

    Felix Gorschl ¨uter, Pavel Rojtberg, and Thomas P ¨ollabauer. A survey of 6d object detection based on 3d models for in- dustrial applications.Journal of Imaging, 8(3), 2022. 1

  14. [14]

    Object- Match: Robust Registration using Canonical Object Corre- spondences

    Can G ¨umeli, Angela Dai, and Matthias Nießner. Object- Match: Robust Registration using Canonical Object Corre- spondences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13082– 13091, 2023. 2, 6

  15. [15]

    OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models

    Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models. InAd- vances in Neural Information Processing Systems, pages 35103–35115, 2022. 2

  16. [16]

    FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects

    Yisheng He, Yao Wang, Haoqiang Fan, Qifeng Chen, and Jian Sun. FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6814–6824,

  17. [17]

    Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes

    Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes. In Computer Vision – ACCV 2012, pages 548–562. Springer,

  18. [18]

    BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),

    Tom ´aˇs Hodaˇn, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Ji ˇr´ı Matas, and Carsten Rother. BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),

  19. [19]

    Bop: Benchmark for 6d object pose estima- tion

    Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Jiri Matas, and Carsten Rother. Bop: Benchmark for 6d object pose estima- tion. InProceedings of the European Conference on Com- puter Vision (ECCV)...

  20. [20]

    MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images

    Junwen Huang, Hao Yu, Kuan-Ting Yu, Nassir Navab, Slo- bodan Ilic, and Benjamin Busam. MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10095–10105, 2024. 2

  21. [21]

    V o, Patrick Labatut, and Piotr Bojanowski

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha¨el Ramamonjisoa, Maxime Oquab, Oriane Sim´eoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and...

  22. [22]

    Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios

    HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp, Guangyao Zhai, Hannah Schieber, Giulia Rizzoli, Pengyuan Wang, Hongcheng Zhao, Lorenzo Garattoni, Sven Meier, Daniel Roth, Nassir Navab, and Benjamin Busam. Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios. InProceedings of the IEE...

  23. [23]

    Any6D: Model-free 6d pose estimation of novel objects

    Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6d pose estimation of novel objects. InProceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR),

  24. [24]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3

  25. [25]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  26. [26]

    UA- Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References

    Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, and Cheng-Hao Kuo. UA- Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1180–1189, 2025. 1, 2, 7, 8

  27. [27]

    Decap: Decoding clip latents for zero-shot captioning via text-only training

    Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. Decap: Decoding clip latents for zero-shot captioning via text-only training. InThe Eleventh International Conference on Learn- ing Representations, 2023. 1

  28. [28]

    Gce-pose: Global context enhancement for category-level object pose estimation

    Weihang Li, Hongli Xu, Junwen Huang, Hyunjun Jung, Pe- ter KT Yu, Nassir Navab, and Benjamin Busam. Gce-pose: Global context enhancement for category-level object pose estimation. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 27154–27165,

  29. [29]

    SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

    Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559– 9568, 2024. 2, 8

  30. [30]

    One2Any: One-Reference 6D Pose Estimation for Any Object

    Mengya Liu, Siyuan Li, Ajad Chhatkuli, Prune Truong, Luc Van Gool, and Federico Tombari. One2Any: One-Reference 6D Pose Estimation for Any Object. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6457–6467, 2025. 1, 2, 6

  31. [31]

    Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 3

  32. [32]

    Unopose: Un- seen object pose estimation with an unposed rgb-d reference image

    Xingyu Liu, Gu Wang, Ruida Zhang, Chenyangguang Zhang, Federico Tombari, and Xiangyang Ji. Unopose: Un- seen object pose estimation with an unposed rgb-d reference image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22023–22034, 2025. 1, 2

  33. [33]

    Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images

    Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022. 2, 8

  34. [34]

    David G. Lowe. Distinctive Image Features from Scale- Invariant Keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 6

  35. [35]

    NOPE: Novel Object Pose Estimation from a Sin- gle Image

    Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Yinlin Hu, Renaud Marlet, Mathieu Salzmann, and Vincent Lepetit. NOPE: Novel Object Pose Estimation from a Sin- gle Image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6305– 6314, 2024. 2

  36. [36]

    Gigapose: Fast and robust novel object pose estimation via one correspondence

    Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9903–9913, 2024. 2

  37. [37]

    DINOv2: Learning Robust Visual Fea- tures without Supervision.Transactions on Machine Learn- ing Research, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao Xu, Herv´e J´egou, Julien Maira...

  38. [38]

    Found- pose: Unseen object pose estimation with foundation fea- tures

    Evin Pınar ¨Ornek, Yann Labb ´e, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, and Tomas Hodan. Found- pose: Unseen object pose estimation with foundation fea- tures. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024. 2, 3

  39. [39]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3

  40. [40]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. arXiv preprint arXiv:2401.14159, 2024. 3

  41. [41]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Com- puter Vision, 128(2), 2019. 3, 1

  42. [42]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  43. [43]

    Poulami Sinhamahapatra, Franziska Schwaiger, Shirsha Bose, Huiyu Wang, Karsten Roscher, and Stephan Guen- nemann. Finding dino: A plug-and-play framework for zero-shot detection of out-of-distribution objects using pro- totypes.2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8474–8483, 2024. 1

  44. [44]

    Deep multi-state object pose estimation for augmented reality assembly

    Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. Deep multi-state object pose estimation for augmented reality assembly. In2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pages 222–227, 2019. 1

  45. [45]

    ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation

    Yongzhi Su, Mahdi Saleh, Torben Fetzer, Jason Rambach, Nassir Navab, Benjamin Busam, Didier Stricker, and Fed- erico Tombari. ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6728–6737, 2022. 2

  46. [46]

    Onepose: One-shot object pose estimation without cad mod- els

    Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. Onepose: One-shot object pose estimation without cad mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6825– 6834, 2022. 2

  47. [47]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understand- ing, Localization, and Dense Featu...

  48. [48]

    A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989

    Shinji Umeyama. A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989. 4

  49. [49]

    Densefusion: 6d object pose estimation by iterative dense fusion

    Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 1

  50. [50]

    He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object co- ordinate space for category-level 6d object pose and size esti- mation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2637–2646, 2019. 1, 2, 4, 8

  51. [51]

    Catgrasp: Learning category-level task-relevant grasping in clutter from simulation

    Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. In2022 Interna- tional Conference on Robotics and Automation (ICRA), page 6401–6408. IEEE Press, 2022. 1

  52. [52]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2024. 1, 2, 7, 8

  53. [53]

    Clip-dinoiser: Teaching clip a few dino tricks

    Monika Wysocza’nska, Oriane Sim´eoni, Michael Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci’nski, and Patrick P’erez. Clip-dinoiser: Teaching clip a few dino tricks. InEuropean Conference on Computer Vision, 2023. 1

  54. [54]

    Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

    Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems, 2018. 1, 2, 4, 8

  55. [55]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 3 ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors Supplementary Material

  56. [56]

    Pretrained heads - Zero-shot tasks with dino.txt

    Additional ablations 6.1. Backbone Ablation To demonstrate ConceptPose’s adaptability across different vision-language architectures, we evaluate five backbones on REAL275 (Table 5): three SigLIP2 [47] variants (giant- 384, large-384, base-384), CLIP ViT-L/14@336px [39], and DINOv3-L/16 [42] with the dinotxt text grounding head [21]. Our baseline method i...

  57. [57]

    Concepts used for REAL275 evaluation

    Performance Analysis by Object Table 8. Concepts used for REAL275 evaluation. Cat. #L Concepts bottle 20 cap, lid, spout, nozzle, pump, trigger, body, base, neck, shoulder, label, handle, grip, threads, tam- per evident ring, overcap, sleeve, collar, punt, carry loop bowl 20 body, rim, bottom, foot, handle, lid, spout, ear, pedestal, inner surface, outer ...

  58. [58]

    Extended Qualitative Results Figures 6 and 7 present 10 extended qualitative visualiza- tions of ConceptPose across all four evaluation datasets (REAL275, Toyota-Light, YCB-Video, LINEMOD), with saliency maps and estimated poses. It demonstrates Con- ceptPose’s ability to generate semantically meaningful con- cept activations and accurate pose estimates a...