STORM: Segment, Track, and Object Re-Localization from a Single Image

Hikaru Shindo; Jiahong Xue; Kristian Kersting; Quentin Delfosse; Teng Cao; Yu Deng

arxiv: 2511.09771 · v3 · pith:E2DKBESMnew · submitted 2025-11-12 · 💻 cs.CV

STORM: Segment, Track, and Object Re-Localization from a Single Image

Yu Deng , Teng Cao , Hikaru Shindo , Quentin Delfosse , Jiahong Xue , Kristian Kersting This is my paper

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords 6D pose estimationobject trackingre-localizationhierarchical attentiondrift detectionreference-conditioned trackingcomputer visionpose tracking

0 comments

The pith

STORM tracks 6D object poses from one reference image by fusing features hierarchically and verifying drift for automatic re-initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STORM, a framework for reference-conditioned 6D tracking that works from a single reference image with minimal manual input. It combines Hierarchical Spatial Fusion Attention for reference-query fusion, which handles single or multi-reference cases and optional vision-language conditioning, with a BCE-trained verifier that uses a compatibility score to detect drift and trigger re-initialization. This addresses brittle performance in existing methods that rely on CAD models or per-object adaptation and fail under occlusion or fast motion. A sympathetic reader would care because it aims to make physical AI systems like robots more reliable in dynamic real-world settings without heavy labor. If correct, the approach reduces the need for extensive annotations while improving recovery from challenging conditions.

Core claim

STORM performs segmentation, tracking, and object re-localization for accurate 6D pose estimation from a single reference image. It relies on Hierarchical Spatial Fusion Attention (HSFA) as a task-driven fusion architecture that supports single-reference and multi-reference conditioning with optional vision-language semantics, paired with a BCE-trained tracking verifier whose continuous compatibility logit serves as an energy-like score to detect drift and initiate automatic re-initialization, resulting in improved annotation-free tracking accuracy on LM-O and YCB-Video while recovering reliably from severe occlusions and rapid viewpoint changes.

What carries the argument

Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports single and multi-reference conditioning, together with the BCE-trained tracking verifier that supplies a continuous compatibility logit as an energy-like score for drift detection and re-initialization.

If this is right

Enables annotation-free 6D pose tracking without CAD models or per-object adaptation.
Provides reliable recovery from severe occlusions and rapid viewpoint changes.
Supports both single-reference and multi-reference conditioning.
Allows optional vision-language conditioning to resolve instance ambiguities.
Adds minimal overhead for automatic re-initialization upon detected drift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The verifier's energy-like score could be adapted for uncertainty estimation in related vision tasks such as segmentation or detection.
The single-image conditioning might scale to video streams for continuous tracking in robotics without repeated manual references.
Integration with broader scene understanding could help handle multiple objects or cluttered environments more effectively.
Testing the fusion attention on non-rigid or deformable objects would reveal if the current design extends beyond rigid pose estimation.

Load-bearing premise

The Hierarchical Spatial Fusion Attention architecture and the BCE-trained verifier will maintain performance and generalization outside the two evaluated datasets and under the full range of real-world lighting, texture, and motion variations.

What would settle it

A demonstration on a new dataset or under unseen lighting, textures, or extreme motion conditions where STORM's tracking accuracy drops below strong baselines or fails to recover from occlusions would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.09771 by Hikaru Shindo, Jiahong Xue, Kristian Kersting, Quentin Delfosse, Teng Cao, Yu Deng.

**Figure 1.** Figure 1: Pose-estimation models lack robustness, exemplified with FoundationPose (Wen et al. 2024), that fails to detect a mug under camera pose variation, highlighting its sensitivity to viewpoint shifts. reliance on extensive reference information as model input, creating a bottleneck for practical deployment where systems must handle diverse, previously unseen objects. Recent advances partially address these l… view at source ↗

**Figure 2.** Figure 2: Overview of STORM, which is composed of two subsystems: the Segmenting Object Module (SOM) and the Tracking Object Module (TOM). SOM leverages reference images to generate a 3D model and, using their semantic and spatial information, integrates both intra-image and inter-image attention modules to capture spatial cues of the query frame through local and global attention blocks, producing a segmented mask.… view at source ↗

**Figure 3.** Figure 3: STORM (SOM+TOM) achieves robust pose estimation for occluded objects in complex scenes. We compare pose-estimation qualities on the LMO and YCB-V datasets, which comprise complex scenes with multiple and possibly occluded objects. As baselines, CNOS and GroundTruth are used to predict the segmentation mask, and FoundationPose was used to produce the pose estimation. The results indicate that our method pr… view at source ↗

**Figure 4.** Figure 4: Comparison of 3D models reconstructed from reference images and ground-truth 3D models. The Aligned SAM3D models almost perfectly recover the underlying object structure while producing smoother and more regular contours along object boundaries. Method ADD ADD-S AR Time(ms) STORM (TOM) 74.64 88.56 67.85 98±5 FP Tracking 52.76 66.76 50.09 84±3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Under the same experimental conditions, we com [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: To provide a comprehensive comparison, we evaluated the training loss, training EMA loss, and test set AP for [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: We conducted an ablation study to compare the ef [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: An overview of our tracking dataset. The task is to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STORM tries to deliver single-image 6D tracking plus automatic re-localization through hierarchical attention and a compatibility verifier, but the evaluation stays thin and narrow.

read the letter

STORM's main pitch is a reference-conditioned 6D tracker that starts from one image and includes a built-in way to notice when tracking has drifted and re-start. The authors package hierarchical spatial fusion attention for mixing reference and query features, plus a BCE-trained verifier whose output logit acts as a drift score to trigger re-initialization. The optional vision-language conditioning for disambiguation is a reasonable extra.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces STORM, a unified framework for reference-conditioned 6D pose tracking and re-localization that operates from a single reference image. It combines Hierarchical Spatial Fusion Attention (HSFA) for task-driven reference-query fusion (supporting single- or multi-reference and optional vision-language conditioning) with a BCE-trained tracking verifier that uses a continuous compatibility logit as an energy-like score to detect drift and trigger re-initialization. Experiments on the LM-O and YCB-Video datasets are reported to show improved annotation-free pose tracking accuracy over strong baselines together with reliable recovery from severe occlusions and rapid viewpoint changes at minimal overhead.

Significance. If the quantitative gains and re-localization behavior are substantiated by the full experimental results, the approach could meaningfully reduce dependence on CAD models and per-object manual adaptation in 6D tracking pipelines, offering a practical route toward more robust, annotation-light systems for robotics and physical AI.

major comments (1)

Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.

minor comments (1)

Abstract: the performance claims are stated without any numerical values, error bars, or baseline deltas, which makes it difficult for readers to gauge the magnitude of improvement before reaching the full experimental tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below.

read point-by-point responses

Referee: Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.

Authors: We acknowledge that our primary quantitative results are reported on LM-O and YCB-Video. These remain the most widely adopted benchmarks for reference-conditioned 6D tracking, containing multiple objects, substantial occlusions, texture variation, and rapid viewpoint changes that directly exercise the HSFA fusion mechanism and the BCE verifier's drift detection. The observed gains in annotation-free accuracy and reliable re-initialization are measured against strong baselines under these conditions. We agree that broader cross-domain evaluation would further strengthen claims of generalizability. In the revised manuscript we will add a dedicated subsection discussing the method's behavior across lighting and motion subsets of the existing datasets, include qualitative results on additional real-world sequences, and explicitly note the current scope as a limitation with directions for future cross-domain testing. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation of proposed architecture

full rationale

The paper introduces STORM as a new framework combining Hierarchical Spatial Fusion Attention (HSFA) for reference-query fusion and a BCE-trained verifier for drift detection and re-initialization. Performance claims are grounded in reported experiments on LM-O and YCB-Video datasets showing improved annotation-free tracking accuracy and recovery from occlusions. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the abstract or summary. The derivation chain consists of architectural design choices validated externally by dataset results rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework introduces two new architectural components without external derivations or independent evidence supplied in the abstract.

axioms (1)

domain assumption Standard computer-vision assumptions about feature matching and 6D pose representation hold for the evaluated datasets.
Implicit background for any 6D tracking work.

invented entities (2)

Hierarchical Spatial Fusion Attention (HSFA) no independent evidence
purpose: Task-driven reference-query fusion supporting single- and multi-reference conditioning
New attention architecture proposed to handle reference conditioning.
BCE-trained tracking verifier no independent evidence
purpose: Produce continuous compatibility logit used as energy-like score for drift detection and re-initialization
New verifier component trained with binary cross-entropy.

pith-pipeline@v0.9.0 · 5488 in / 1244 out tokens · 31514 ms · 2026-05-17T22:53:06.582086+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hierarchical Spatial Fusion Attention (HSFA) ... fuses multi-view self-attention and cross-attention ... BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Back, S.; Lee, J.; Kim, T.; Noh, S.; Kang, R.; Bak, S.; and Lee, K. 2022. Unseen object amodal instance segmentation via hierarchical occlusion modeling. In International Conference on Robotics and Automation (ICRA)

work page 2022
[4]

Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; and Rother, C. 2014. Learning 6d object pose estimation using 3d object coordinates. In ECCV 2014: 13th European Conference

work page 2014
[5]

Cerkezi, L.; and Favaro, P. 2024. Sparse 3D Reconstruction via Object-Centric Ray Sampling. In Proceedings of the International Conference on 3D Vision (3DV)

work page 2024
[6]

Cremers, D.; and Kolev, K. 2011. Multiview Stereo and Silhouette Consistency via Convex Functionals over Convex Domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6): 1161--1174

work page 2011
[7]

X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M

Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5828--5839

work page 2017
[8]

???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration

Dai, A.; Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Theobalt, C. ???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (ToG)

work page
[9]

Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; and Fox, D. 2021. PoseRBPF: A Rao--Blackwellized particle filter for 6-D object pose tracking. IEEE Transactions on Robotics

work page 2021
[10]

Do, T.-T.; Cai, M.; Pham, T.; and Reid, I. 2018. Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint

work page 2018
[11]

Engel, J.; Sch \"o ps, T.; and Cremers, D. 2014. LSD-SLAM: Large-scale Direct Monocular SLAM. In Computer Vision -- ECCV 2014, volume 8690 of Lecture Notes in Computer Science, 834--849. Springer

work page 2014
[12]

He, K.; Gkioxari, G.; Doll \'a r, P.; and Girshick, R. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV)

work page 2017
[13]

He, Y.; Huang, H.; Fan, H.; Chen, Q.; and Sun, J. 2021. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2021
[14]

He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; and Sun, J. 2020. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2020
[15]

Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; and Navab, N. 2012. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision

work page 2012
[16]

Hod \` a n, T.; Haluza, P.; Obdr z \' a lek, S .; Matas, J.; Lourakis, M.; and Zabulis, X. 2017. T-LESS : An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects. In IEEE Winter Conference on Applications of Computer Vision (WACV)

work page 2017
[17]

Hod \` a n, T.; Michel, F.; Brachmann, E.; Kehl, W.; Glent Buch, A.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; Sahin, C.; Manhardt, F.; Tombari, F.; Kim, T.; Matas, J.; and Rother, C. 2018. BOP: Benchmark for 6D Object Pose Estimation. In European Conference on Computer Vision (ECCV)

work page 2018
[18]

Hoda n , T.; Sundermeyer, M.; Drost, B.; Labb \'e , Y.; Brachmann, E.; Michel, F.; Rother, C.; and Matas, J. 2020. BOP challenge 2020 on 6D object localization. In ECCV 2020 Workshops

work page 2020
[19]

Hu, Y.; Hugonot, J.; Fua, P.; and Salzmann, M. 2019. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3385--3394

work page 2019
[20]

G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J

Kappler, D.; Meier, F.; Issac, J.; Mainprice, J.; Cifuentes, C. G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J. 2018. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters

work page 2018
[21]

Kaskman, R.; Zakharov, S.; Shugurov, I.; and Ilic, S. 2019. HomebrewedDB: RGB-D Dataset for 6D Pose Estimation of 3D Objects. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

work page 2019
[22]

Labb \'e , Y.; Carpentier, J.; Aubry, M.; and Sivic, J. 2020. Cosypose: Consistent multi-view multi-object 6d pose estimation. In ECCV 2020: 16th European Conference. Springer

work page 2020
[23]

Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; and Fox, D. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV)

work page 2018
[24]

Lin, J.; Liu, L.; Lu, D.; and Jia, K. 2024. Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In Conference on Computer Vision and Pattern Recognition

work page 2024
[25]

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll \'a r, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision

work page 2017
[26]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision

work page 2014
[27]

Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint

work page 2016
[28]

L \"u ddecke, T.; and Ecker, A. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2022
[29]

Marchand, E.; Uchiyama, H.; and Spindler, F. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics

work page 2015
[30]

M \"u ller, N.; Simonelli, A.; Porzi, L.; Rota Bul \`o , S.; Nie ner, M.; and Kontschieder, P. 2022. AutoRF: Learning 3D Object Radiance Fields from Single View Observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3961--3970

work page 2022
[31]

N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T

Nguyen, V. N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T. 2023. Cnos: A strong baseline for cad-based novel object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision

work page 2023
[32]

Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Stamminger, M. 2013. Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM Transactions on Graphics, 32(6): 169:1--169:11

work page 2013
[33]

Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint

work page 2023
[34]

Park, K.; Patten, T.; and Vincze, M. 2019. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision

work page 2019
[35]

Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning

work page 2013
[36]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning

work page 2021
[37]

Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; R \"a dle, R.; Rolland, C.; Gustafson, L.; et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint

work page 2024
[38]

E.; and De Souza, A

Rennie, C.; Shome, R.; Bekris, K. E.; and De Souza, A. F. 2016. A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters

work page 2016
[39]

E.; Hinton, G

Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature

work page 1986
[40]

Salehi, S. S. M.; Erdogmus, D.; and Gholipour, A. 2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In International workshop on machine learning in medical imaging

work page 2017
[41]

S.; Schramowski, P.; and Kersting, K

Shindo, H.; Brack, M.; Sudhakaran, G.; Dhami, D. S.; Schramowski, P.; and Kersting, K. 2024. Deisam: Segment anything with deictic prompting. Advances in Neural Information Processing Systems

work page 2024
[42]

Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; and Tombari, F. 2022. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2022
[43]

Team, G. 2025. Gemma 3n

work page 2025
[44]

D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K

Team, S. D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K. J.; Sax, A.; Tang, H.; Wang, W.; Guo, M.; Hardin, T.; Li, X.; Lin, A.; Liu, J.; Ma, Z.; Sagar, A.; Song, B.; Wang, X.; Yang, J.; Zhang, B.; Dollár, P.; Gkioxari, G.; Feiszli, M.; and Malik, J. 2025. SAM 3D: 3Dfy Anything in Images

work page 2025
[45]

Tjaden, H.; Schwanecke, U.; and Schomer, E. 2017. Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In Proceedings of the IEEE international conference on computer vision

work page 2017
[46]

H.; and Leibe, B

Voigtlaender, P.; Luiten, J.; Torr, P. H.; and Leibe, B. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2020
[47]

Wang, C.; Mart \' n-Mart \' n, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; and Zhu, Y. 2020. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

work page 2020
[48]

Wen, B.; Lian, W.; Bekris, K.; and Schaal, S. 2022. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint

work page 2022
[49]

Wen, B.; Mitash, C.; Soorian, S.; Kimmel, A.; Sintov, A.; and Bekris, K. E. 2020. Robust, occlusion-aware pose estimation for objects grasped by adaptive hands. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

work page 2020
[50]

Wen, B.; Yang, W.; Kautz, J.; and Birchfield, S. 2024. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2024
[51]

M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D

Wong, J. M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D. M.; et al. 2017. Segicp: Integrated deep semantic segmentation and pose estimation. In International Conference on Intelligent Robots and Systems (IROS)

work page 2017
[52]

Xiang, Y.; Schmidt, T.; Narayanan, V.; and Fox, D. 2017. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint

work page 2017
[53]

Zhang, R.; Jiang, Z.; Guo, Z.; Yan, S.; Pan, J.; Ma, X.; Dong, H.; Gao, P.; and Li, H. 2023. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations

work page 2023

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Back, S.; Lee, J.; Kim, T.; Noh, S.; Kang, R.; Bak, S.; and Lee, K. 2022. Unseen object amodal instance segmentation via hierarchical occlusion modeling. In International Conference on Robotics and Automation (ICRA)

work page 2022

[4] [4]

Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; and Rother, C. 2014. Learning 6d object pose estimation using 3d object coordinates. In ECCV 2014: 13th European Conference

work page 2014

[5] [5]

Cerkezi, L.; and Favaro, P. 2024. Sparse 3D Reconstruction via Object-Centric Ray Sampling. In Proceedings of the International Conference on 3D Vision (3DV)

work page 2024

[6] [6]

Cremers, D.; and Kolev, K. 2011. Multiview Stereo and Silhouette Consistency via Convex Functionals over Convex Domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6): 1161--1174

work page 2011

[7] [7]

X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M

Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5828--5839

work page 2017

[8] [8]

???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration

Dai, A.; Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Theobalt, C. ???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (ToG)

work page

[9] [9]

Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; and Fox, D. 2021. PoseRBPF: A Rao--Blackwellized particle filter for 6-D object pose tracking. IEEE Transactions on Robotics

work page 2021

[10] [10]

Do, T.-T.; Cai, M.; Pham, T.; and Reid, I. 2018. Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint

work page 2018

[11] [11]

Engel, J.; Sch \"o ps, T.; and Cremers, D. 2014. LSD-SLAM: Large-scale Direct Monocular SLAM. In Computer Vision -- ECCV 2014, volume 8690 of Lecture Notes in Computer Science, 834--849. Springer

work page 2014

[12] [12]

He, K.; Gkioxari, G.; Doll \'a r, P.; and Girshick, R. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV)

work page 2017

[13] [13]

He, Y.; Huang, H.; Fan, H.; Chen, Q.; and Sun, J. 2021. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2021

[14] [14]

He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; and Sun, J. 2020. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2020

[15] [15]

Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; and Navab, N. 2012. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision

work page 2012

[16] [16]

Hod \` a n, T.; Haluza, P.; Obdr z \' a lek, S .; Matas, J.; Lourakis, M.; and Zabulis, X. 2017. T-LESS : An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects. In IEEE Winter Conference on Applications of Computer Vision (WACV)

work page 2017

[17] [17]

Hod \` a n, T.; Michel, F.; Brachmann, E.; Kehl, W.; Glent Buch, A.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; Sahin, C.; Manhardt, F.; Tombari, F.; Kim, T.; Matas, J.; and Rother, C. 2018. BOP: Benchmark for 6D Object Pose Estimation. In European Conference on Computer Vision (ECCV)

work page 2018

[18] [18]

Hoda n , T.; Sundermeyer, M.; Drost, B.; Labb \'e , Y.; Brachmann, E.; Michel, F.; Rother, C.; and Matas, J. 2020. BOP challenge 2020 on 6D object localization. In ECCV 2020 Workshops

work page 2020

[19] [19]

Hu, Y.; Hugonot, J.; Fua, P.; and Salzmann, M. 2019. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3385--3394

work page 2019

[20] [20]

G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J

Kappler, D.; Meier, F.; Issac, J.; Mainprice, J.; Cifuentes, C. G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J. 2018. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters

work page 2018

[21] [21]

Kaskman, R.; Zakharov, S.; Shugurov, I.; and Ilic, S. 2019. HomebrewedDB: RGB-D Dataset for 6D Pose Estimation of 3D Objects. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

work page 2019

[22] [22]

Labb \'e , Y.; Carpentier, J.; Aubry, M.; and Sivic, J. 2020. Cosypose: Consistent multi-view multi-object 6d pose estimation. In ECCV 2020: 16th European Conference. Springer

work page 2020

[23] [23]

Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; and Fox, D. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV)

work page 2018

[24] [24]

Lin, J.; Liu, L.; Lu, D.; and Jia, K. 2024. Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In Conference on Computer Vision and Pattern Recognition

work page 2024

[25] [25]

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll \'a r, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision

work page 2017

[26] [26]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision

work page 2014

[27] [27]

Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint

work page 2016

[28] [28]

L \"u ddecke, T.; and Ecker, A. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2022

[29] [29]

Marchand, E.; Uchiyama, H.; and Spindler, F. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics

work page 2015

[30] [30]

M \"u ller, N.; Simonelli, A.; Porzi, L.; Rota Bul \`o , S.; Nie ner, M.; and Kontschieder, P. 2022. AutoRF: Learning 3D Object Radiance Fields from Single View Observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3961--3970

work page 2022

[31] [31]

N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T

Nguyen, V. N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T. 2023. Cnos: A strong baseline for cad-based novel object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision

work page 2023

[32] [32]

Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Stamminger, M. 2013. Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM Transactions on Graphics, 32(6): 169:1--169:11

work page 2013

[33] [33]

Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint

work page 2023

[34] [34]

Park, K.; Patten, T.; and Vincze, M. 2019. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision

work page 2019

[35] [35]

Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning

work page 2013

[36] [36]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning

work page 2021

[37] [37]

Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; R \"a dle, R.; Rolland, C.; Gustafson, L.; et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint

work page 2024

[38] [38]

E.; and De Souza, A

Rennie, C.; Shome, R.; Bekris, K. E.; and De Souza, A. F. 2016. A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters

work page 2016

[39] [39]

E.; Hinton, G

Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature

work page 1986

[40] [40]

Salehi, S. S. M.; Erdogmus, D.; and Gholipour, A. 2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In International workshop on machine learning in medical imaging

work page 2017

[41] [41]

S.; Schramowski, P.; and Kersting, K

Shindo, H.; Brack, M.; Sudhakaran, G.; Dhami, D. S.; Schramowski, P.; and Kersting, K. 2024. Deisam: Segment anything with deictic prompting. Advances in Neural Information Processing Systems

work page 2024

[42] [42]

Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; and Tombari, F. 2022. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2022

[43] [43]

Team, G. 2025. Gemma 3n

work page 2025

[44] [44]

D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K

Team, S. D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K. J.; Sax, A.; Tang, H.; Wang, W.; Guo, M.; Hardin, T.; Li, X.; Lin, A.; Liu, J.; Ma, Z.; Sagar, A.; Song, B.; Wang, X.; Yang, J.; Zhang, B.; Dollár, P.; Gkioxari, G.; Feiszli, M.; and Malik, J. 2025. SAM 3D: 3Dfy Anything in Images

work page 2025

[45] [45]

Tjaden, H.; Schwanecke, U.; and Schomer, E. 2017. Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In Proceedings of the IEEE international conference on computer vision

work page 2017

[46] [46]

H.; and Leibe, B

Voigtlaender, P.; Luiten, J.; Torr, P. H.; and Leibe, B. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

work page 2020

[47] [47]

Wang, C.; Mart \' n-Mart \' n, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; and Zhu, Y. 2020. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

work page 2020

[48] [48]

Wen, B.; Lian, W.; Bekris, K.; and Schaal, S. 2022. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint

work page 2022

[49] [49]

Wen, B.; Mitash, C.; Soorian, S.; Kimmel, A.; Sintov, A.; and Bekris, K. E. 2020. Robust, occlusion-aware pose estimation for objects grasped by adaptive hands. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

work page 2020

[50] [50]

Wen, B.; Yang, W.; Kautz, J.; and Birchfield, S. 2024. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page 2024

[51] [51]

M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D

Wong, J. M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D. M.; et al. 2017. Segicp: Integrated deep semantic segmentation and pose estimation. In International Conference on Intelligent Robots and Systems (IROS)

work page 2017

[52] [52]

Xiang, Y.; Schmidt, T.; Narayanan, V.; and Fox, D. 2017. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint

work page 2017

[53] [53]

Zhang, R.; Jiang, Z.; Guo, Z.; Yan, S.; Pan, J.; Ma, X.; Dong, H.; Gao, P.; and Li, H. 2023. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations

work page 2023