pith. sign in

arxiv: 2511.09771 · v3 · pith:E2DKBESMnew · submitted 2025-11-12 · 💻 cs.CV

STORM: Segment, Track, and Object Re-Localization from a Single Image

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords 6D pose estimationobject trackingre-localizationhierarchical attentiondrift detectionreference-conditioned trackingcomputer visionpose tracking
0
0 comments X

The pith

STORM tracks 6D object poses from one reference image by fusing features hierarchically and verifying drift for automatic re-initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STORM, a framework for reference-conditioned 6D tracking that works from a single reference image with minimal manual input. It combines Hierarchical Spatial Fusion Attention for reference-query fusion, which handles single or multi-reference cases and optional vision-language conditioning, with a BCE-trained verifier that uses a compatibility score to detect drift and trigger re-initialization. This addresses brittle performance in existing methods that rely on CAD models or per-object adaptation and fail under occlusion or fast motion. A sympathetic reader would care because it aims to make physical AI systems like robots more reliable in dynamic real-world settings without heavy labor. If correct, the approach reduces the need for extensive annotations while improving recovery from challenging conditions.

Core claim

STORM performs segmentation, tracking, and object re-localization for accurate 6D pose estimation from a single reference image. It relies on Hierarchical Spatial Fusion Attention (HSFA) as a task-driven fusion architecture that supports single-reference and multi-reference conditioning with optional vision-language semantics, paired with a BCE-trained tracking verifier whose continuous compatibility logit serves as an energy-like score to detect drift and initiate automatic re-initialization, resulting in improved annotation-free tracking accuracy on LM-O and YCB-Video while recovering reliably from severe occlusions and rapid viewpoint changes.

What carries the argument

Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports single and multi-reference conditioning, together with the BCE-trained tracking verifier that supplies a continuous compatibility logit as an energy-like score for drift detection and re-initialization.

If this is right

  • Enables annotation-free 6D pose tracking without CAD models or per-object adaptation.
  • Provides reliable recovery from severe occlusions and rapid viewpoint changes.
  • Supports both single-reference and multi-reference conditioning.
  • Allows optional vision-language conditioning to resolve instance ambiguities.
  • Adds minimal overhead for automatic re-initialization upon detected drift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verifier's energy-like score could be adapted for uncertainty estimation in related vision tasks such as segmentation or detection.
  • The single-image conditioning might scale to video streams for continuous tracking in robotics without repeated manual references.
  • Integration with broader scene understanding could help handle multiple objects or cluttered environments more effectively.
  • Testing the fusion attention on non-rigid or deformable objects would reveal if the current design extends beyond rigid pose estimation.

Load-bearing premise

The Hierarchical Spatial Fusion Attention architecture and the BCE-trained verifier will maintain performance and generalization outside the two evaluated datasets and under the full range of real-world lighting, texture, and motion variations.

What would settle it

A demonstration on a new dataset or under unseen lighting, textures, or extreme motion conditions where STORM's tracking accuracy drops below strong baselines or fails to recover from occlusions would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.09771 by Hikaru Shindo, Jiahong Xue, Kristian Kersting, Quentin Delfosse, Teng Cao, Yu Deng.

Figure 1
Figure 1. Figure 1: Pose-estimation models lack robustness, exem￾plified with FoundationPose (Wen et al. 2024), that fails to detect a mug under camera pose variation, highlighting its sensitivity to viewpoint shifts. reliance on extensive reference information as model input, creating a bottleneck for practical deployment where sys￾tems must handle diverse, previously unseen objects. Recent advances partially address these l… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of STORM, which is composed of two subsystems: the Segmenting Object Module (SOM) and the Tracking Object Module (TOM). SOM leverages reference images to generate a 3D model and, using their semantic and spatial information, integrates both intra-image and inter-image attention modules to capture spatial cues of the query frame through local and global attention blocks, producing a segmented mask.… view at source ↗
Figure 3
Figure 3. Figure 3: STORM (SOM+TOM) achieves robust pose estimation for occluded objects in complex scenes. We compare pose-estimation qualities on the LMO and YCB-V datasets, which comprise complex scenes with multiple and possibly oc￾cluded objects. As baselines, CNOS and GroundTruth are used to predict the segmentation mask, and FoundationPose was used to produce the pose estimation. The results indicate that our method pr… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of 3D models reconstructed from reference images and ground-truth 3D models. The Aligned SAM3D models almost perfectly recover the under￾lying object structure while producing smoother and more regular contours along object boundaries. Method ADD ADD-S AR Time(ms) STORM (TOM) 74.64 88.56 67.85 98±5 FP Tracking 52.76 66.76 50.09 84±3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Under the same experimental conditions, we com [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: To provide a comprehensive comparison, we evaluated the training loss, training EMA loss, and test set AP for [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: We conducted an ablation study to compare the ef [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An overview of our tracking dataset. The task is to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces STORM, a unified framework for reference-conditioned 6D pose tracking and re-localization that operates from a single reference image. It combines Hierarchical Spatial Fusion Attention (HSFA) for task-driven reference-query fusion (supporting single- or multi-reference and optional vision-language conditioning) with a BCE-trained tracking verifier that uses a continuous compatibility logit as an energy-like score to detect drift and trigger re-initialization. Experiments on the LM-O and YCB-Video datasets are reported to show improved annotation-free pose tracking accuracy over strong baselines together with reliable recovery from severe occlusions and rapid viewpoint changes at minimal overhead.

Significance. If the quantitative gains and re-localization behavior are substantiated by the full experimental results, the approach could meaningfully reduce dependence on CAD models and per-object manual adaptation in 6D tracking pipelines, offering a practical route toward more robust, annotation-light systems for robotics and physical AI.

major comments (1)
  1. Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.
minor comments (1)
  1. Abstract: the performance claims are stated without any numerical values, error bars, or baseline deltas, which makes it difficult for readers to gauge the magnitude of improvement before reaching the full experimental tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Experiments section: the central claim of improved annotation-free tracking accuracy and reliable drift-triggered re-localization rests on results from only LM-O and YCB-Video. Both datasets employ a narrow range of objects, lighting conditions, and motion profiles; no additional datasets, cross-domain tests, or ablation on lighting/texture/motion variations are described, leaving open whether the HSFA fusion and BCE verifier preserve the reported gains outside these collections.

    Authors: We acknowledge that our primary quantitative results are reported on LM-O and YCB-Video. These remain the most widely adopted benchmarks for reference-conditioned 6D tracking, containing multiple objects, substantial occlusions, texture variation, and rapid viewpoint changes that directly exercise the HSFA fusion mechanism and the BCE verifier's drift detection. The observed gains in annotation-free accuracy and reliable re-initialization are measured against strong baselines under these conditions. We agree that broader cross-domain evaluation would further strengthen claims of generalizability. In the revised manuscript we will add a dedicated subsection discussing the method's behavior across lighting and motion subsets of the existing datasets, include qualitative results on additional real-world sequences, and explicitly note the current scope as a limitation with directions for future cross-domain testing. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation of proposed architecture

full rationale

The paper introduces STORM as a new framework combining Hierarchical Spatial Fusion Attention (HSFA) for reference-query fusion and a BCE-trained verifier for drift detection and re-initialization. Performance claims are grounded in reported experiments on LM-O and YCB-Video datasets showing improved annotation-free tracking accuracy and recovery from occlusions. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the abstract or summary. The derivation chain consists of architectural design choices validated externally by dataset results rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework introduces two new architectural components without external derivations or independent evidence supplied in the abstract.

axioms (1)
  • domain assumption Standard computer-vision assumptions about feature matching and 6D pose representation hold for the evaluated datasets.
    Implicit background for any 6D tracking work.
invented entities (2)
  • Hierarchical Spatial Fusion Attention (HSFA) no independent evidence
    purpose: Task-driven reference-query fusion supporting single- and multi-reference conditioning
    New attention architecture proposed to handle reference conditioning.
  • BCE-trained tracking verifier no independent evidence
    purpose: Produce continuous compatibility logit used as energy-like score for drift detection and re-initialization
    New verifier component trained with binary cross-entropy.

pith-pipeline@v0.9.0 · 5488 in / 1244 out tokens · 31514 ms · 2026-05-17T22:53:06.582086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Back, S.; Lee, J.; Kim, T.; Noh, S.; Kang, R.; Bak, S.; and Lee, K. 2022. Unseen object amodal instance segmentation via hierarchical occlusion modeling. In International Conference on Robotics and Automation (ICRA)

  4. [4]

    Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; and Rother, C. 2014. Learning 6d object pose estimation using 3d object coordinates. In ECCV 2014: 13th European Conference

  5. [5]

    Cerkezi, L.; and Favaro, P. 2024. Sparse 3D Reconstruction via Object-Centric Ray Sampling. In Proceedings of the International Conference on 3D Vision (3DV)

  6. [6]

    Cremers, D.; and Kolev, K. 2011. Multiview Stereo and Silhouette Consistency via Convex Functionals over Convex Domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6): 1161--1174

  7. [7]

    X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M

    Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nie ner, M. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5828--5839

  8. [8]

    ???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration

    Dai, A.; Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Theobalt, C. ???? Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (ToG)

  9. [9]

    Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; and Fox, D. 2021. PoseRBPF: A Rao--Blackwellized particle filter for 6-D object pose tracking. IEEE Transactions on Robotics

  10. [10]

    Do, T.-T.; Cai, M.; Pham, T.; and Reid, I. 2018. Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint

  11. [11]

    Engel, J.; Sch \"o ps, T.; and Cremers, D. 2014. LSD-SLAM: Large-scale Direct Monocular SLAM. In Computer Vision -- ECCV 2014, volume 8690 of Lecture Notes in Computer Science, 834--849. Springer

  12. [12]

    He, K.; Gkioxari, G.; Doll \'a r, P.; and Girshick, R. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV)

  13. [13]

    He, Y.; Huang, H.; Fan, H.; Chen, Q.; and Sun, J. 2021. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  14. [14]

    He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; and Sun, J. 2020. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  15. [15]

    Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; and Navab, N. 2012. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision

  16. [16]

    Hod \` a n, T.; Haluza, P.; Obdr z \' a lek, S .; Matas, J.; Lourakis, M.; and Zabulis, X. 2017. T-LESS : An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects. In IEEE Winter Conference on Applications of Computer Vision (WACV)

  17. [17]

    Hod \` a n, T.; Michel, F.; Brachmann, E.; Kehl, W.; Glent Buch, A.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; Sahin, C.; Manhardt, F.; Tombari, F.; Kim, T.; Matas, J.; and Rother, C. 2018. BOP: Benchmark for 6D Object Pose Estimation. In European Conference on Computer Vision (ECCV)

  18. [18]

    Hoda n , T.; Sundermeyer, M.; Drost, B.; Labb \'e , Y.; Brachmann, E.; Michel, F.; Rother, C.; and Matas, J. 2020. BOP challenge 2020 on 6D object localization. In ECCV 2020 Workshops

  19. [19]

    Hu, Y.; Hugonot, J.; Fua, P.; and Salzmann, M. 2019. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3385--3394

  20. [20]

    G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J

    Kappler, D.; Meier, F.; Issac, J.; Mainprice, J.; Cifuentes, C. G.; W \"u thrich, M.; Berenz, V.; Schaal, S.; Ratliff, N.; and Bohg, J. 2018. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters

  21. [21]

    Kaskman, R.; Zakharov, S.; Shugurov, I.; and Ilic, S. 2019. HomebrewedDB: RGB-D Dataset for 6D Pose Estimation of 3D Objects. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

  22. [22]

    Labb \'e , Y.; Carpentier, J.; Aubry, M.; and Sivic, J. 2020. Cosypose: Consistent multi-view multi-object 6d pose estimation. In ECCV 2020: 16th European Conference. Springer

  23. [23]

    Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; and Fox, D. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV)

  24. [24]

    Lin, J.; Liu, L.; Lu, D.; and Jia, K. 2024. Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In Conference on Computer Vision and Pattern Recognition

  25. [25]

    Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll \'a r, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision

  26. [26]

    Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision

  27. [27]

    Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint

  28. [28]

    L \"u ddecke, T.; and Ecker, A. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  29. [29]

    Marchand, E.; Uchiyama, H.; and Spindler, F. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics

  30. [30]

    M \"u ller, N.; Simonelli, A.; Porzi, L.; Rota Bul \`o , S.; Nie ner, M.; and Kontschieder, P. 2022. AutoRF: Learning 3D Object Radiance Fields from Single View Observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3961--3970

  31. [31]

    N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T

    Nguyen, V. N.; Groueix, T.; Ponimatkin, G.; Lepetit, V.; and Hodan, T. 2023. Cnos: A strong baseline for cad-based novel object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision

  32. [32]

    Nie ner, M.; Zollh \"o fer, M.; Izadi, S.; and Stamminger, M. 2013. Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM Transactions on Graphics, 32(6): 169:1--169:11

  33. [33]

    Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint

  34. [34]

    Park, K.; Patten, T.; and Vincze, M. 2019. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision

  35. [35]

    Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning

  36. [36]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning

  37. [37]

    Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; R \"a dle, R.; Rolland, C.; Gustafson, L.; et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint

  38. [38]

    E.; and De Souza, A

    Rennie, C.; Shome, R.; Bekris, K. E.; and De Souza, A. F. 2016. A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters

  39. [39]

    E.; Hinton, G

    Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature

  40. [40]

    Salehi, S. S. M.; Erdogmus, D.; and Gholipour, A. 2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In International workshop on machine learning in medical imaging

  41. [41]

    S.; Schramowski, P.; and Kersting, K

    Shindo, H.; Brack, M.; Sudhakaran, G.; Dhami, D. S.; Schramowski, P.; and Kersting, K. 2024. Deisam: Segment anything with deictic prompting. Advances in Neural Information Processing Systems

  42. [42]

    Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; and Tombari, F. 2022. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  43. [43]

    Team, G. 2025. Gemma 3n

  44. [44]

    D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K

    Team, S. D.; Chen, X.; Chu, F.-J.; Gleize, P.; Liang, K. J.; Sax, A.; Tang, H.; Wang, W.; Guo, M.; Hardin, T.; Li, X.; Lin, A.; Liu, J.; Ma, Z.; Sagar, A.; Song, B.; Wang, X.; Yang, J.; Zhang, B.; Dollár, P.; Gkioxari, G.; Feiszli, M.; and Malik, J. 2025. SAM 3D: 3Dfy Anything in Images

  45. [45]

    Tjaden, H.; Schwanecke, U.; and Schomer, E. 2017. Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In Proceedings of the IEEE international conference on computer vision

  46. [46]

    H.; and Leibe, B

    Voigtlaender, P.; Luiten, J.; Torr, P. H.; and Leibe, B. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  47. [47]

    Wang, C.; Mart \' n-Mart \' n, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; and Zhu, Y. 2020. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

  48. [48]

    Wen, B.; Lian, W.; Bekris, K.; and Schaal, S. 2022. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint

  49. [49]

    Wen, B.; Mitash, C.; Soorian, S.; Kimmel, A.; Sintov, A.; and Bekris, K. E. 2020. Robust, occlusion-aware pose estimation for objects grasped by adaptive hands. In 2020 IEEE International Conference on Robotics and Automation (ICRA)

  50. [50]

    Wen, B.; Yang, W.; Kautz, J.; and Birchfield, S. 2024. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  51. [51]

    M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D

    Wong, J. M.; Kee, V.; Le, T.; Wagner, S.; Mariottini, G.-L.; Schneider, A.; Hamilton, L.; Chipalkatty, R.; Hebert, M.; Johnson, D. M.; et al. 2017. Segicp: Integrated deep semantic segmentation and pose estimation. In International Conference on Intelligent Robots and Systems (IROS)

  52. [52]

    Xiang, Y.; Schmidt, T.; Narayanan, V.; and Fox, D. 2017. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint

  53. [53]

    Zhang, R.; Jiang, Z.; Guo, Z.; Yan, S.; Pan, J.; Ma, X.; Dong, H.; Gao, P.; and Li, H. 2023. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations