ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Pith reviewed 2026-05-16 23:40 UTC · model grok-4.3
The pith
Concept vectors from vision-language model saliency maps enable training-free 6DoF relative pose estimation by matching 3D-3D correspondences across views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConceptPose builds open-vocabulary 3D concept maps in which every surface point receives a concept vector derived from VLM saliency responses; robust 3D-3D correspondences are then found by matching these vectors across two views, allowing direct recovery of the relative 6DoF pose. The method requires no object-specific training, no 3D models, and no dataset collection, yet reports a 62% relative gain in average ADD(-S) score over the strongest prior baseline on common zero-shot pose benchmarks.
What carries the argument
Concept vectors extracted from VLM saliency maps, which tag 3D points with semantic descriptors used to establish repeatable 3D-3D correspondences for pose solving.
If this is right
- Any object whose parts can be described in natural language becomes a candidate for zero-shot pose recovery.
- Pose estimation pipelines no longer need separate training stages or 3D mesh acquisition for each new object class.
- Relative pose can be computed directly from two RGB images using only a frozen VLM and standard correspondence solvers.
- Performance scales with improvements in the underlying vision-language model rather than with additional pose-specific data collection.
Where Pith is reading between the lines
- The same concept-vector representation may support downstream tasks such as part-based manipulation or object-centric SLAM without retraining.
- Failure cases are likely to appear on objects whose surface regions produce nearly identical saliency responses, suggesting targeted prompt engineering as a practical extension.
- Because the method never sees 3D geometry during training, it offers a route to pose estimation in domains where 3D ground truth is expensive or impossible to obtain.
Load-bearing premise
The concept vectors produced by VLM saliency maps remain sufficiently consistent across different viewpoints of the same unseen object to yield reliable 3D point matches.
What would settle it
An experiment showing that, for a set of object pairs with known ground-truth poses, the nearest-neighbor matches based on concept vectors produce pose errors larger than random guessing on more than half the pairs would falsify the central claim.
Figures
read the original abstract
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ConceptPose, a training-free zero-shot method for 6DoF object pose estimation. It uses a pretrained VLM to generate open-vocabulary 3D concept maps by tagging points with concept vectors derived from 2D saliency maps, then solves for relative pose via 3D-3D correspondences and standard solvers. The central claim is that this achieves state-of-the-art results on common zero-shot relative pose benchmarks, outperforming the strongest baseline by 62% relative in average ADD(-S), including methods that use extensive dataset-specific training.
Significance. If the central claim holds with robust evidence, the result would be significant: it would show that VLM saliency-derived concept vectors can produce sufficiently repeatable 3D correspondences for accurate pose recovery on unseen objects, eliminating the need for object- or dataset-specific training and outperforming supervised baselines. This could shift practice in robotics and vision toward purely zero-shot pipelines.
major comments (3)
- [Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.
- [Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.
- [Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.
minor comments (2)
- [Abstract] The term 'model-free' in the abstract is potentially misleading given the reliance on a pretrained VLM; clarify the intended meaning.
- Add references to prior VLM-based correspondence or zero-shot pose work to better situate the novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify areas for improved clarity or additional evidence, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of a 62% relative ADD(-S) improvement and SOTA performance is presented without naming the specific benchmarks, the full list of baselines (zero-shot vs. trained), absolute scores, or any ablation isolating the contribution of the concept vectors versus the correspondence solver or VLM backbone.
Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we have expanded the abstract to name the evaluation benchmarks, distinguish zero-shot from trained baselines, report absolute ADD(-S) scores in addition to the relative improvement, and briefly reference the ablations that isolate the concept-vector contribution. Full ablation tables and baseline lists remain in the main text to respect abstract length constraints. revision: yes
-
Referee: [Abstract] The manuscript provides no quantitative analysis of 3D-3D correspondence quality (inlier rates, repeatability across viewpoint changes, or failure modes on textureless/symmetric objects), which is load-bearing for attributing the reported pose accuracy to the proposed concept-vector mechanism rather than the external VLM or solver.
Authors: We acknowledge that explicit correspondence-quality metrics would strengthen attribution of the performance gains. The revised manuscript now includes a dedicated quantitative analysis section reporting inlier rates, viewpoint-repeatability scores, and failure-mode breakdowns on textureless and symmetric objects. These results are presented alongside the end-to-end pose metrics to isolate the contribution of the concept vectors. revision: yes
-
Referee: [Abstract] The lifting of 2D VLM saliency maps into 3D concept maps (including how per-point concept vectors are computed and made viewpoint-invariant) lacks algorithmic detail or pseudocode, preventing verification that the resulting correspondences are dense and stable enough for 6DoF recovery on unseen objects.
Authors: We agree that additional algorithmic transparency is warranted. Although the original manuscript describes the lifting procedure in Section 3, the revised version adds explicit pseudocode for the 2D-to-3D concept-map construction and expands the text explaining per-point vector computation and viewpoint-invariance via aggregated VLM features. These changes make the density and stability of the resulting correspondences easier to verify. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper presents ConceptPose as a training-free method that extracts concept vectors from an external pretrained VLM's saliency maps and then applies standard 3D-3D correspondence solvers for 6DoF pose. No equations or steps reduce the reported pose accuracy to quantities that are fitted or defined inside the paper itself; the central mechanism relies on the external VLM's zero-shot properties and off-the-shelf correspondence techniques rather than any self-referential construction, fitted input renamed as prediction, or load-bearing self-citation chain. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM saliency maps can be lifted to stable 3D concept vectors that support accurate 3D-3D matching across views
Forward citations
Cited by 1 Pith paper
-
TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.
Reference graph
Works this paper leans on
-
[1]
Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis
Eduardo Arnold, Omar Y . Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applica- tions.IEEE Transactions on Intelligent Transportation Sys- tems, 20(10):3782–3795, 2019. 1
work page 2019
-
[2]
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025– 22035, 2025. 3
work page 2025
-
[3]
GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting
Dingding Cai, Janne Heikkila, and Esa Rahtu. GS-Pose: Generalizable Segmentation-Based 6D Object Pose Estima- tion with 3D Gaussian Splatting . In2025 International Con- ference on 3D Vision (3DV), pages 1001–1011, Los Alami- tos, CA, USA, 2025. IEEE Computer Society. 2
work page 2025
-
[4]
Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models
Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 2
work page 2024
-
[5]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 2
work page 2021
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cav- allaro, and Fabio Poiesi. Open-vocabulary object 6d pose estimation.IEEE/CVF Computer Vision and Pattern Recog- nition Conference (CVPR), 2024. 1, 2, 3, 5, 6
work page 2024
-
[8]
Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Chang- jae Oh, Andrea Cavallaro, and Fabio Poiesi. High-resolution open-vocabulary object 6d pose estimation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 1, 2, 3, 6
work page 2025
-
[9]
Open- vocabulary universal image segmentation with MaskCLIP
Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. InProceedings of the 40th International Conference on Ma- chine Learning, pages 8090–8102. PMLR, 2023. 3
work page 2023
-
[10]
Hugo Durchon, Marius Preda, and Titus B. Zaharia. 6d pose estimation of unseen objects for industrial augmented real- ity.2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP), pages 1– 8, 2024. 1
work page 2024
-
[11]
POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference
Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Dejia Xu, Hanwen Jiang, and Zhangyang Wang. POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference. InCVPRW, pages 2692–2701, 2024. 2, 3
work page 2024
-
[12]
Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4
work page 1981
-
[13]
Felix Gorschl ¨uter, Pavel Rojtberg, and Thomas P ¨ollabauer. A survey of 6d object detection based on 3d models for in- dustrial applications.Journal of Imaging, 8(3), 2022. 1
work page 2022
-
[14]
Object- Match: Robust Registration using Canonical Object Corre- spondences
Can G ¨umeli, Angela Dai, and Matthias Nießner. Object- Match: Robust Registration using Canonical Object Corre- spondences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13082– 13091, 2023. 2, 6
work page 2023
-
[15]
OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models
Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-Free One- Shot Object Pose Estimation without CAD Models. InAd- vances in Neural Information Processing Systems, pages 35103–35115, 2022. 2
work page 2022
-
[16]
FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects
Yisheng He, Yao Wang, Haoqiang Fan, Qifeng Chen, and Jian Sun. FS6D: Few-Shot 6D Pose Estimation of Novel Ob- jects. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6814–6824,
-
[17]
Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes. In Computer Vision – ACCV 2012, pages 548–562. Springer,
work page 2012
-
[18]
BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),
Tom ´aˇs Hodaˇn, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Ji ˇr´ı Matas, and Carsten Rother. BOP: Benchmark for 6D object pose esti- mation.European Conference on Computer Vision (ECCV),
-
[19]
Bop: Benchmark for 6d object pose estima- tion
Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Man- hardt, Federico Tombari, Tae-Kyun Kim, Jiri Matas, and Carsten Rother. Bop: Benchmark for 6d object pose estima- tion. InProceedings of the European Conference on Com- puter Vision (ECCV)...
work page 2018
-
[20]
MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images
Junwen Huang, Hao Yu, Kuan-Ting Yu, Nassir Navab, Slo- bodan Ilic, and Benjamin Busam. MatchU: Matching Un- seen Objects for 6D Pose Estimation from RGB-D Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10095–10105, 2024. 2
work page 2024
-
[21]
V o, Patrick Labatut, and Piotr Bojanowski
Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha¨el Ramamonjisoa, Maxime Oquab, Oriane Sim´eoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and...
work page 2025
-
[22]
HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp, Guangyao Zhai, Hannah Schieber, Giulia Rizzoli, Pengyuan Wang, Hongcheng Zhao, Lorenzo Garattoni, Sven Meier, Daniel Roth, Nassir Navab, and Benjamin Busam. Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios. InProceedings of the IEE...
work page 2024
-
[23]
Any6D: Model-free 6d pose estimation of novel objects
Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6d pose estimation of novel objects. InProceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR),
-
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3
work page 2022
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3
work page 2023
-
[26]
Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, and Cheng-Hao Kuo. UA- Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1180–1189, 2025. 1, 2, 7, 8
work page 2025
-
[27]
Decap: Decoding clip latents for zero-shot captioning via text-only training
Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. Decap: Decoding clip latents for zero-shot captioning via text-only training. InThe Eleventh International Conference on Learn- ing Representations, 2023. 1
work page 2023
-
[28]
Gce-pose: Global context enhancement for category-level object pose estimation
Weihang Li, Hongli Xu, Junwen Huang, Hyunjun Jung, Pe- ter KT Yu, Nassir Navab, and Benjamin Busam. Gce-pose: Global context enhancement for category-level object pose estimation. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 27154–27165,
-
[29]
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559– 9568, 2024. 2, 8
work page 2024
-
[30]
One2Any: One-Reference 6D Pose Estimation for Any Object
Mengya Liu, Siyuan Li, Ajad Chhatkuli, Prune Truong, Luc Van Gool, and Federico Tombari. One2Any: One-Reference 6D Pose Estimation for Any Object. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6457–6467, 2025. 1, 2, 6
work page 2025
-
[31]
Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection. InEuropean conference on computer vision, pages 38–55. Springer, 2024. 3
work page 2024
-
[32]
Unopose: Un- seen object pose estimation with an unposed rgb-d reference image
Xingyu Liu, Gu Wang, Ruida Zhang, Chenyangguang Zhang, Federico Tombari, and Xiangyang Ji. Unopose: Un- seen object pose estimation with an unposed rgb-d reference image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22023–22034, 2025. 1, 2
work page 2025
-
[33]
Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images
Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022. 2, 8
work page 2022
-
[34]
David G. Lowe. Distinctive Image Features from Scale- Invariant Keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 6
work page 2004
-
[35]
NOPE: Novel Object Pose Estimation from a Sin- gle Image
Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Yinlin Hu, Renaud Marlet, Mathieu Salzmann, and Vincent Lepetit. NOPE: Novel Object Pose Estimation from a Sin- gle Image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6305– 6314, 2024. 2
work page 2024
-
[36]
Gigapose: Fast and robust novel object pose estimation via one correspondence
Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9903–9913, 2024. 2
work page 2024
-
[37]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao Xu, Herv´e J´egou, Julien Maira...
work page 2024
-
[38]
Found- pose: Unseen object pose estimation with foundation fea- tures
Evin Pınar ¨Ornek, Yann Labb ´e, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, and Tomas Hodan. Found- pose: Unseen object pose estimation with foundation fea- tures. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024. 2, 3
work page 2024
-
[39]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3
work page 2021
-
[40]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. arXiv preprint arXiv:2401.14159, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Com- puter Vision, 128(2), 2019. 3, 1
work page 2019
-
[42]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Poulami Sinhamahapatra, Franziska Schwaiger, Shirsha Bose, Huiyu Wang, Karsten Roscher, and Stephan Guen- nemann. Finding dino: A plug-and-play framework for zero-shot detection of out-of-distribution objects using pro- totypes.2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8474–8483, 2024. 1
work page 2025
-
[44]
Deep multi-state object pose estimation for augmented reality assembly
Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. Deep multi-state object pose estimation for augmented reality assembly. In2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pages 222–227, 2019. 1
work page 2019
-
[45]
ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation
Yongzhi Su, Mahdi Saleh, Torben Fetzer, Jason Rambach, Nassir Navab, Benjamin Busam, Didier Stricker, and Fed- erico Tombari. ZebraPose: Coarse to Fine Surface Encod- ing for 6DoF Object Pose Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6728–6737, 2022. 2
work page 2022
-
[46]
Onepose: One-shot object pose estimation without cad mod- els
Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. Onepose: One-shot object pose estimation without cad mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6825– 6834, 2022. 2
work page 2022
-
[47]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understand- ing, Localization, and Dense Featu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989
Shinji Umeyama. A point pattern matching algorithm.Sys- tems and Computers in Japan, 20(10):95–106, 1989. 4
work page 1989
-
[49]
Densefusion: 6d object pose estimation by iterative dense fusion
Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart ´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 1
work page 2019
-
[50]
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object co- ordinate space for category-level 6d object pose and size esti- mation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2637–2646, 2019. 1, 2, 4, 8
work page 2019
-
[51]
Catgrasp: Learning category-level task-relevant grasping in clutter from simulation
Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. In2022 Interna- tional Conference on Robotics and Automation (ICRA), page 6401–6408. IEEE Press, 2022. 1
work page 2022
-
[52]
Foundationpose: Unified 6d pose estimation and tracking of novel objects
Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2024. 1, 2, 7, 8
work page 2024
-
[53]
Clip-dinoiser: Teaching clip a few dino tricks
Monika Wysocza’nska, Oriane Sim´eoni, Michael Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci’nski, and Patrick P’erez. Clip-dinoiser: Teaching clip a few dino tricks. InEuropean Conference on Computer Vision, 2023. 1
work page 2023
-
[54]
Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes
Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems, 2018. 1, 2, 4, 8
work page 2018
-
[55]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 3 ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors Supplementary Material
work page 2023
-
[56]
Pretrained heads - Zero-shot tasks with dino.txt
Additional ablations 6.1. Backbone Ablation To demonstrate ConceptPose’s adaptability across different vision-language architectures, we evaluate five backbones on REAL275 (Table 5): three SigLIP2 [47] variants (giant- 384, large-384, base-384), CLIP ViT-L/14@336px [39], and DINOv3-L/16 [42] with the dinotxt text grounding head [21]. Our baseline method i...
work page 2048
-
[57]
Concepts used for REAL275 evaluation
Performance Analysis by Object Table 8. Concepts used for REAL275 evaluation. Cat. #L Concepts bottle 20 cap, lid, spout, nozzle, pump, trigger, body, base, neck, shoulder, label, handle, grip, threads, tam- per evident ring, overcap, sleeve, collar, punt, carry loop bowl 20 body, rim, bottom, foot, handle, lid, spout, ear, pedestal, inner surface, outer ...
-
[58]
Extended Qualitative Results Figures 6 and 7 present 10 extended qualitative visualiza- tions of ConceptPose across all four evaluation datasets (REAL275, Toyota-Light, YCB-Video, LINEMOD), with saliency maps and estimated poses. It demonstrates Con- ceptPose’s ability to generate semantically meaningful con- cept activations and accurate pose estimates a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.