STORM: Segment, Track, and Object Re-Localization from a Single Image
Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3
The pith
STORM tracks 6D object poses from one reference image by fusing features hierarchically and verifying drift for automatic re-initialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STORM performs segmentation, tracking, and object re-localization for accurate 6D pose estimation from a single reference image. It relies on Hierarchical Spatial Fusion Attention (HSFA) as a task-driven fusion architecture that supports single-reference and multi-reference conditioning with optional vision-language semantics, paired with a BCE-trained tracking verifier whose continuous compatibility logit serves as an energy-like score to detect drift and initiate automatic re-initialization, resulting in improved annotation-free tracking accuracy on LM-O and YCB-Video while recovering reliably from severe occlusions and rapid viewpoint changes.
What carries the argument
Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports single and multi-reference conditioning, together with the BCE-trained tracking verifier that supplies a continuous compatibility logit as an energy-like score for drift detection and re-initialization.
If this is right
- Enables annotation-free 6D pose tracking without CAD models or per-object adaptation.
- Provides reliable recovery from severe occlusions and rapid viewpoint changes.
- Supports both single-reference and multi-reference conditioning.
- Allows optional vision-language conditioning to resolve instance ambiguities.
- Adds minimal overhead for automatic re-initialization upon detected drift.
Where Pith is reading between the lines
- The verifier's energy-like score could be adapted for uncertainty estimation in related vision tasks such as segmentation or detection.
- The single-image conditioning might scale to video streams for continuous tracking in robotics without repeated manual references.
- Integration with broader scene understanding could help handle multiple objects or cluttered environments more effectively.
- Testing the fusion attention on non-rigid or deformable objects would reveal if the current design extends beyond rigid pose estimation.
Load-bearing premise
The Hierarchical Spatial Fusion Attention architecture and the BCE-trained verifier will maintain performance and generalization outside the two evaluated datasets and under the full range of real-world lighting, texture, and motion variations.
What would settle it
A demonstration on a new dataset or under unseen lighting, textures, or extreme motion conditions where STORM's tracking accuracy drops below strong baselines or fails to recover from occlusions would show the central claim does not hold.
Figures
read the original abstract
Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces STORM, a unified framework for reference-conditioned 6D pose tracking and re-localization that operates from a single reference image. It combines Hierarchical Spatial Fusion Attention (HSFA) for task-driven reference-query fusion (supporting single- or multi-reference and optional vision-language conditioning) with a BCE-trained tracking verifier that uses a continuous compatibility logit as an energy-like score to detect drift and trigger re-initialization. Experiments on the LM-O and YCB-Video datasets are reported to show improved annotation-free pose tracking accuracy over strong baselines together with reliable recovery from severe occlusions and rapid viewpoint changes at minimal overhead.
Significance. If the quantitative gains and re-localization behavior are substantiated by the full experimental results, the approach could meaningfully reduce dependence on CAD models and per-object manual adaptation in 6D tracking pipelines, offering a practical route toward more robust, annotation-light systems for robotics and physical AI.
major comments (1)
- Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.
minor comments (1)
- Abstract: the performance claims are stated without any numerical values, error bars, or baseline deltas, which makes it difficult for readers to gauge the magnitude of improvement before reaching the full experimental tables.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the major comment point by point below.
read point-by-point responses
-
Referee: Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.
Authors: We acknowledge that our primary quantitative results are reported on LM-O and YCB-Video. These remain the most widely adopted benchmarks for reference-conditioned 6D tracking, containing multiple objects, substantial occlusions, texture variation, and rapid viewpoint changes that directly exercise the HSFA fusion mechanism and the BCE verifier's drift detection. The observed gains in annotation-free accuracy and reliable re-initialization are measured against strong baselines under these conditions. We agree that broader cross-domain evaluation would further strengthen claims of generalizability. In the revised manuscript we will add a dedicated subsection discussing the method's behavior across lighting and motion subsets of the existing datasets, include qualitative results on additional real-world sequences, and explicitly note the current scope as a limitation with directions for future cross-domain testing. revision: partial
Circularity Check
No circularity: claims rest on empirical evaluation of proposed architecture
full rationale
The paper introduces STORM as a new framework combining Hierarchical Spatial Fusion Attention (HSFA) for reference-query fusion and a BCE-trained verifier for drift detection and re-initialization. Performance claims are grounded in reported experiments on LM-O and YCB-Video datasets showing improved annotation-free tracking accuracy and recovery from occlusions. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the abstract or summary. The derivation chain consists of architectural design choices validated externally by dataset results rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard computer-vision assumptions about feature matching and 6D pose representation hold for the evaluated datasets.
invented entities (2)
-
Hierarchical Spatial Fusion Attention (HSFA)
no independent evidence
-
BCE-trained tracking verifier
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hierarchical Spatial Fusion Attention (HSFA) ... fuses multi-view self-attention and cross-attention ... BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Back, S.; Lee, J.; Kim, T.; Noh, S.; Kang, R.; Bak, S.; and Lee, K. 2022. Unseen object amodal instance segmentation via hierarchical occlusion modeling. In International Conference on Robotics and Automation (ICRA)
work page 2022
-
[4]
Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; and Rother, C. 2014. Learning 6d object pose estimation using 3d object coordinates. In ECCV 2014: 13th European Conference
work page 2014
-
[5]
Cerkezi, L.; and Favaro, P. 2024. Sparse 3D Reconstruction via Object-Centric Ray Sampling. In Proceedings of the International Conference on 3D Vision (3DV)
work page 2024
-
[6]
Cremers, D.; and Kolev, K. 2011. Multiview Stereo and Silhouette Consistency via Convex Functionals over Convex Domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6): 1161--1174
work page 2011
-
[7]
X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5828--5839
work page 2017
-
[8]
Dai, A.; Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Theobalt, C. ???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (ToG)
-
[9]
Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; and Fox, D. 2021. PoseRBPF: A Rao--Blackwellized particle filter for 6-D object pose tracking. IEEE Transactions on Robotics
work page 2021
-
[10]
Do, T.-T.; Cai, M.; Pham, T.; and Reid, I. 2018. Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint
work page 2018
-
[11]
Engel, J.; Sch \"o ps, T.; and Cremers, D. 2014. LSD-SLAM: Large-scale Direct Monocular SLAM. In Computer Vision -- ECCV 2014, volume 8690 of Lecture Notes in Computer Science, 834--849. Springer
work page 2014
-
[12]
He, K.; Gkioxari, G.; Doll \'a r, P.; and Girshick, R. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV)
work page 2017
-
[13]
He, Y.; Huang, H.; Fan, H.; Chen, Q.; and Sun, J. 2021. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
work page 2021
-
[14]
He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; and Sun, J. 2020. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
work page 2020
-
[15]
Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; and Navab, N. 2012. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision
work page 2012
-
[16]
Hod \` a n, T.; Haluza, P.; Obdr z \' a lek, S .; Matas, J.; Lourakis, M.; and Zabulis, X. 2017. T-LESS : An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects. In IEEE Winter Conference on Applications of Computer Vision (WACV)
work page 2017
-
[17]
Hod \` a n, T.; Michel, F.; Brachmann, E.; Kehl, W.; Glent Buch, A.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; Sahin, C.; Manhardt, F.; Tombari, F.; Kim, T.; Matas, J.; and Rother, C. 2018. BOP: Benchmark for 6D Object Pose Estimation. In European Conference on Computer Vision (ECCV)
work page 2018
-
[18]
Hoda n , T.; Sundermeyer, M.; Drost, B.; Labb \'e , Y.; Brachmann, E.; Michel, F.; Rother, C.; and Matas, J. 2020. BOP challenge 2020 on 6D object localization. In ECCV 2020 Workshops
work page 2020
-
[19]
Hu, Y.; Hugonot, J.; Fua, P.; and Salzmann, M. 2019. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3385--3394
work page 2019
-
[20]
G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J
Kappler, D.; Meier, F.; Issac, J.; Mainprice, J.; Cifuentes, C. G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J. 2018. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters
work page 2018
-
[21]
Kaskman, R.; Zakharov, S.; Shugurov, I.; and Ilic, S. 2019. HomebrewedDB: RGB-D Dataset for 6D Pose Estimation of 3D Objects. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
work page 2019
-
[22]
Labb \'e , Y.; Carpentier, J.; Aubry, M.; and Sivic, J. 2020. Cosypose: Consistent multi-view multi-object 6d pose estimation. In ECCV 2020: 16th European Conference. Springer
work page 2020
-
[23]
Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; and Fox, D. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV)
work page 2018
-
[24]
Lin, J.; Liu, L.; Lu, D.; and Jia, K. 2024. Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In Conference on Computer Vision and Pattern Recognition
work page 2024
-
[25]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll \'a r, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision
work page 2017
-
[26]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision
work page 2014
-
[27]
Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint
work page 2016
-
[28]
L \"u ddecke, T.; and Ecker, A. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
work page 2022
-
[29]
Marchand, E.; Uchiyama, H.; and Spindler, F. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics
work page 2015
-
[30]
M \"u ller, N.; Simonelli, A.; Porzi, L.; Rota Bul \`o , S.; Nie ner, M.; and Kontschieder, P. 2022. AutoRF: Learning 3D Object Radiance Fields from Single View Observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3961--3970
work page 2022
-
[31]
N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T
Nguyen, V. N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T. 2023. Cnos: A strong baseline for cad-based novel object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision
work page 2023
-
[32]
Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Stamminger, M. 2013. Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM Transactions on Graphics, 32(6): 169:1--169:11
work page 2013
-
[33]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint
work page 2023
-
[34]
Park, K.; Patten, T.; and Vincze, M. 2019. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision
work page 2019
-
[35]
Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning
work page 2013
-
[36]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning
work page 2021
-
[37]
Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; R \"a dle, R.; Rolland, C.; Gustafson, L.; et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint
work page 2024
-
[38]
Rennie, C.; Shome, R.; Bekris, K. E.; and De Souza, A. F. 2016. A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters
work page 2016
-
[39]
Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature
work page 1986
-
[40]
Salehi, S. S. M.; Erdogmus, D.; and Gholipour, A. 2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In International workshop on machine learning in medical imaging
work page 2017
-
[41]
S.; Schramowski, P.; and Kersting, K
Shindo, H.; Brack, M.; Sudhakaran, G.; Dhami, D. S.; Schramowski, P.; and Kersting, K. 2024. Deisam: Segment anything with deictic prompting. Advances in Neural Information Processing Systems
work page 2024
-
[42]
Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; and Tombari, F. 2022. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2022
-
[43]
Team, G. 2025. Gemma 3n
work page 2025
-
[44]
D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K
Team, S. D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K. J.; Sax, A.; Tang, H.; Wang, W.; Guo, M.; Hardin, T.; Li, X.; Lin, A.; Liu, J.; Ma, Z.; Sagar, A.; Song, B.; Wang, X.; Yang, J.; Zhang, B.; Dollár, P.; Gkioxari, G.; Feiszli, M.; and Malik, J. 2025. SAM 3D: 3Dfy Anything in Images
work page 2025
-
[45]
Tjaden, H.; Schwanecke, U.; and Schomer, E. 2017. Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In Proceedings of the IEEE international conference on computer vision
work page 2017
-
[46]
Voigtlaender, P.; Luiten, J.; Torr, P. H.; and Leibe, B. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
work page 2020
-
[47]
Wang, C.; Mart \' n-Mart \' n, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; and Zhu, Y. 2020. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA)
work page 2020
-
[48]
Wen, B.; Lian, W.; Bekris, K.; and Schaal, S. 2022. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint
work page 2022
-
[49]
Wen, B.; Mitash, C.; Soorian, S.; Kimmel, A.; Sintov, A.; and Bekris, K. E. 2020. Robust, occlusion-aware pose estimation for objects grasped by adaptive hands. In 2020 IEEE International Conference on Robotics and Automation (ICRA)
work page 2020
-
[50]
Wen, B.; Yang, W.; Kautz, J.; and Birchfield, S. 2024. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2024
-
[51]
Wong, J. M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D. M.; et al. 2017. Segicp: Integrated deep semantic segmentation and pose estimation. In International Conference on Intelligent Robots and Systems (IROS)
work page 2017
-
[52]
Xiang, Y.; Schmidt, T.; Narayanan, V.; and Fox, D. 2017. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint
work page 2017
-
[53]
Zhang, R.; Jiang, Z.; Guo, Z.; Yan, S.; Pan, J.; Ma, X.; Dong, H.; Gao, P.; and Li, H. 2023. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.