pith. machine review for the scientific record. sign in

arxiv: 2603.28045 · v2 · submitted 2026-03-30 · 💻 cs.CV

Recognition: no theorem link

Event6D: Event-based Novel Object 6D Pose Tracking

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords event camera6D pose trackingnovel object trackingevent-based visiondepth reconstructionsynthetic to realreal-time trackingobject pose estimation
0
0 comments X

The pith

EventTrack6D tracks 6D poses of novel objects using event cameras by reconstructing intensity and depth from event streams at over 120 FPS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for 6D object pose tracking that uses event camera data to handle fast motions where standard cameras fail due to blur. It reconstructs dense intensity and depth information from sparse event streams using the latest depth measurement as conditioning. This allows the system to track objects it has never seen before during training, generalizing from synthetic data to real scenarios without additional fine-tuning. Such an approach matters because it enables reliable pose estimation in dynamic environments like robotics or augmented reality where speed and adaptability to new objects are essential.

Core claim

EventTrack6D is an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, the dual reconstruction recovers dense photometric and geometric cues from sparse event streams, operating at over 120 FPS while maintaining temporal consistency under rapid motion.

What carries the argument

The dual reconstruction network that recovers dense photometric and geometric cues from sparse event streams, conditioned on the most recent depth measurement.

If this is right

  • Operates in real time at over 120 FPS for fast dynamic scenes.
  • Generalizes from synthetic training data to real-world scenarios without fine-tuning.
  • Maintains accurate tracking across diverse objects and motion patterns.
  • Provides a benchmark suite including synthetic training data and real and simulated evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This reconstruction approach could potentially be adapted for other event-based vision tasks such as optical flow estimation in high-speed scenarios.
  • By avoiding object-specific training, the method opens possibilities for deploying tracking systems in environments with frequently changing object sets.
  • The reliance on recent depth measurements suggests integration with depth sensors could further improve performance in varying lighting conditions.

Load-bearing premise

The dual reconstruction network can reliably recover dense photometric and geometric cues from sparse event streams for arbitrary novel objects and rapid motions when conditioned only on the most recent depth measurement.

What would settle it

Observing a significant drop in tracking accuracy on real event data with rapid object motions or unseen object shapes compared to synthetic benchmarks would indicate the reconstruction does not generalize as claimed.

Figures

Figures reproduced from arXiv: 2603.28045 by Bowen Wen, Hoonhee Cho, Jae-Young Kang, Kuk-Jin Yoon, Minjun Kang, Taeyeop Lee, Youngho Kim.

Figure 1
Figure 1. Figure 1: Conventional RGB-D based methods often fail under highly [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our EventTrack6D. EventTrack6D consists of a dual-modal reconstruction module and a pose refinement module. It can perform 6D [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System designed for acquiring the Event6D dataset. The event [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of 6D object tracking at 120 FPS on the Event6D dataset. Original FoundationPose(FP) [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative depth-reconstruction results on depth-absent intervals. The future depth [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of the data used for camera calibration. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of Hand-eye calibration. We denote the OptiTrack [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of trigger signals for overall system. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: EventBlender6D samples visualized as temporal streams of RGB, event, depth, and corresponding 6D object poses. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: EventHO3D samples visualized as temporal streams of RGB, event, depth, and corresponding 6D object poses. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Event6D test samples visualized as temporal streams of RGB, event, depth, and corresponding 6D object poses. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison on the Event6D drill object sequence. Although the event-based methods operate at intervals corresponding to 120 FPS, [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison on the Event6D marker object sequence. Although the event-based methods operate at intervals corresponding to 120 FPS, [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents EventTrack6D, an event-based framework for 6D pose tracking of novel objects. It uses a dual reconstruction network to recover dense intensity and depth cues from sparse event streams, conditioned on the most recent depth measurement, enabling tracking at arbitrary timestamps between depth frames. Trained exclusively on synthetic data, the method claims to generalize effectively to real-world scenarios without fine-tuning, operating above 120 FPS while maintaining temporal consistency under rapid motion. The work introduces a large-scale synthetic training dataset and two evaluation sets (real and simulated events) to support this, with public code and data release.

Significance. If the synthetic-to-real generalization without fine-tuning holds under the reported conditions, the result would advance event-camera applications in fast-motion 6D tracking by eliminating object-specific training requirements. The public datasets and code provide a concrete benchmark that could facilitate follow-on work in event-based vision.

major comments (1)
  1. [§3.2] §3.2 (Dual Reconstruction Network): Conditioning the network solely on the single most recent depth measurement creates a risk that geometric priors become stale under rapid object motion or inter-frame depth changes. Because the central no-fine-tuning generalization claim depends on reliable dense cue recovery from events alone, the manuscript should include targeted ablations or error analysis on motion speed and depth variation to confirm that inference from the event stream remains accurate for novel shapes.
minor comments (2)
  1. [Abstract] The abstract states performance at 'over 120 FPS' without specifying the hardware platform, input resolution, or exact timing breakdown between reconstruction and tracking stages.
  2. [§5] Table or figure captions for the benchmark datasets should explicitly list the number of objects, motion types, and event rates to allow direct comparison with prior event-tracking work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of EventTrack6D. We address the single major comment below by agreeing to incorporate the requested analyses, which we believe will further strengthen the evidence for our synthetic-to-real generalization claims under rapid motion.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dual Reconstruction Network): Conditioning the network solely on the single most recent depth measurement creates a risk that geometric priors become stale under rapid object motion or inter-frame depth changes. Because the central no-fine-tuning generalization claim depends on reliable dense cue recovery from events alone, the manuscript should include targeted ablations or error analysis on motion speed and depth variation to confirm that inference from the event stream remains accurate for novel shapes.

    Authors: We thank the referee for this insightful observation on potential staleness of geometric priors. Our dual reconstruction network is explicitly trained on synthetic sequences that include diverse motion speeds and depth variations, allowing the event stream to provide continuous high-frequency updates that compensate for any outdated depth conditioning. Original experiments already demonstrate robust tracking at >120 FPS on rapid-motion real and simulated sequences without fine-tuning. To directly address the request, the revised manuscript adds a new ablation subsection that systematically varies inter-frame motion velocity and depth change magnitude, reporting both reconstruction error and final 6D pose accuracy for novel objects. These results show graceful degradation, confirming that event-based inference remains reliable even when the most recent depth measurement is stale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical generalization presented as experimental outcome

full rationale

The paper's central claim—that EventTrack6D, trained only on synthetic data, generalizes to real novel objects without fine-tuning—is framed as an empirical result validated on introduced benchmarks rather than a quantity derived by definition or self-referential fitting. The abstract describes the dual reconstruction network as a methodological component conditioned on recent depth to recover cues from events, but does not equate the reported tracking accuracy or generalization performance to any fitted parameter or input defined from the same data. No equations, self-citations, or uniqueness theorems are invoked in the provided text to force the result; the synthetic-to-real transfer is presented as an observed outcome of the framework and datasets. This leaves the derivation self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a learned dual reconstruction can invert sparse event data into usable dense cues for any novel object; no explicit free parameters, new axioms, or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption Event camera output can be treated as a reliable sparse signal of brightness changes that, when combined with occasional depth frames, suffices to reconstruct dense intensity and geometry.
    Invoked in the description of the dual reconstruction step.

pith-pipeline@v0.9.0 · 5511 in / 1273 out tokens · 38759 ms · 2026-05-14T22:04:20.260511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · 1 internal anchor

  1. [1]

    Mtevent: A multi-task event camera dataset for 6d pose estimation and moving object detection

    Shrutarv Awasthi, Anas Gouda, Sven Franke, J ´erˆome Ruti- nowski, Frank Hoffmann, and Moritz Roidl. Mtevent: A multi-task event camera dataset for 6d pose estimation and moving object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5102–5110, 2025. 3

  2. [2]

    Grasp- clutter6d: A large-scale real-world dataset for robust per- ception and grasping in cluttered scenes.IEEE Robotics and Automation Letters, 2025

    Seunghyeok Back, Joosoon Lee, Kangmin Kim, Heeseon Rho, Geonhyup Lee, Raeyoung Kang, Sangbeom Lee, Sangjun Noh, Youngjin Lee, Taeyeop Lee, et al. Grasp- clutter6d: A large-scale real-world dataset for robust per- ception and grasping in cluttered scenes.IEEE Robotics and Automation Letters, 2025. 2

  3. [3]

    Introducing hot3d: An egocentric dataset for 3d hand and object tracking.arXiv preprint arXiv:2406.09598,

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Fan Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, et al. Introducing hot3d: An egocentric dataset for 3d hand and object tracking.arXiv preprint arXiv:2406.09598,

  4. [4]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025. 5

  5. [5]

    Simultaneous optical flow and intensity estimation from an event camera

    Patrick Bardow, Andrew J Davison, and Stefan Leuteneg- ger. Simultaneous optical flow and intensity estimation from an event camera. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 884–892, 2016. 2

  6. [6]

    Real-time image- based tracking of planes using efficient second-order min- imization

    Selim Benhimane and Ezio Malis. Real-time image- based tracking of planes using efficient second-order min- imization. In2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), pages 943–948. IEEE, 2004. 2

  7. [7]

    Method for registration of 3-d shapes

    Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. InSensor fusion IV: control paradigms and data structures, pages 586–606. Spie, 1992. 2, 6, 4

  8. [8]

    The ycb object and model set: Towards common benchmarks for manip- ulation research

    Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manip- ulation research. In2015 international conference on ad- vanced robotics (ICAR), pages 510–517. IEEE, 2015. 1, 5, 3

  9. [9]

    Dexycb: A benchmark for capturing hand grasping of objects

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9044–9053, 2021. 1, 5

  10. [10]

    Repurposing pre- trained video diffusion models for event-based video inter- polation

    Jingxi Chen, Brandon Y Feng, Haoming Cai, Tianfu Wang, Levi Burner, Dehao Yuan, Cornelia Fermuller, Christo- pher A Metzler, and Yiannis Aloimonos. Repurposing pre- trained video diffusion models for event-based video inter- polation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 12456–12466, 2025. 3

  11. [11]

    Esvio: Event- based stereo visual inertial odometry.IEEE Robotics and Automation Letters, 8(6):3661–3668, 2023

    Peiyu Chen, Weipeng Guan, and Peng Lu. Esvio: Event- based stereo visual inertial odometry.IEEE Robotics and Automation Letters, 8(6):3661–3668, 2023. 3

  12. [12]

    Gre-slam: 6-dof pure event- based slam with semi-dense depth recovery assisted bundle adjustment

    Yang Chen and Lin Zhang. Gre-slam: 6-dof pure event- based slam with semi-dense depth recovery assisted bundle adjustment. InProceedings of the 2025 International Con- ference on Multimedia Retrieval, pages 90–98, 2025. 3

  13. [13]

    Tem- poral event stereo via joint learning with stereoscopic flow

    Hoonhee Cho, Jae-Young Kang, and Kuk-Jin Yoon. Tem- poral event stereo via joint learning with stereoscopic flow. InEuropean Conference on Computer Vision, pages 294–

  14. [14]

    A benchmark dataset for event-guided human pose estimation and tracking in extreme conditions.Advances in Neural Information Processing Systems, 37:134826– 134840, 2024

    Hoonhee Cho, Taewoo Kim, Yuhwan Jeong, and Kuk-Jin Yoon. A benchmark dataset for event-guided human pose estimation and tracking in extreme conditions.Advances in Neural Information Processing Systems, 37:134826– 134840, 2024. 3

  15. [15]

    Tta-evf: Test-time adaptation for event-based video frame interpolation via reliable pixel and sample estima- tion

    Hoonhee Cho, Taewoo Kim, Yuhwan Jeong, and Kuk-Jin Yoon. Tta-evf: Test-time adaptation for event-based video frame interpolation via reliable pixel and sample estima- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25701–25711,

  16. [16]

    Ev-3dod: Pushing the temporal boundaries of 3d object detection with event cameras

    Hoonhee Cho, Jae-young Kang, Youngho Kim, and Kuk- Jin Yoon. Ev-3dod: Pushing the temporal boundaries of 3d object detection with event cameras. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27197–27210, 2025. 3

  17. [17]

    Dense hand-object (ho) graspnet with full grasping taxonomy and dynamics

    Woojin Cho, Jihyun Lee, Minjae Yi, Minje Kim, Taeyun Woo, Donghwan Kim, Taewook Ha, Hyokeun Lee, Je- Hwan Ryu, Woontack Woo, et al. Dense hand-object (ho) graspnet with full grasping taxonomy and dynamics. In European Conference on Computer Vision, pages 284–303. Springer, 2024. 5, 3

  18. [18]

    Real-time markerless tracking for augmented reality: the virtual visual servoing framework

    Andrew I Comport, Eric Marchand, Muriel Pressigout, and Francois Chaumette. Real-time markerless tracking for augmented reality: the virtual visual servoing framework. IEEE Transactions on visualization and computer graph- ics, 12(4):615–628, 2006. 2

  19. [19]

    Interacting maps for fast vi- sual interpretation

    Matthew Cook, Luca Gugelmann, Florian Jug, Christoph Krautz, and Angelika Steger. Interacting maps for fast vi- sual interpretation. InThe 2011 International Joint Con- ference on Neural Networks, pages 770–776. IEEE, 2011. 2

  20. [20]

    Robust 3d track- ing with descriptor fields

    Alberto Crivellaro and Vincent Lepetit. Robust 3d track- ing with descriptor fields. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 3414–3421, 2014. 2

  21. [21]

    Poserbpf: A rao–blackwellized particle filter for 6-d object pose tracking.IEEE Transac- tions on Robotics, 37(5):1328–1342, 2021

    Xinke Deng, Arsalan Mousavian, Yu Xiang, Fei Xia, Tim- othy Bretl, and Dieter Fox. Poserbpf: A rao–blackwellized particle filter for 6-d object pose tracking.IEEE Transac- tions on Robotics, 37(5):1328–1342, 2021. 2

  22. [22]

    Blenderproc: Reducing the reality gap with photorealistic rendering

    Maximilian Denninger, Martin Sundermeyer, Dominik Winkelbauer, Dmitry Olefir, Tomas Hodan, Youssef Zidan, Mohamad Elbadrawy, Markus Knauer, Harinandan Katam, and Ahsan Lodhi. Blenderproc: Reducing the reality gap with photorealistic rendering. In16th Robotics: Science and Systems, RSS 2020, Workshops, 2020. 5

  23. [23]

    Blenderproc2: A procedural pipeline for photorealistic ren- dering.Journal of Open Source Software, 8(82):4901,

    Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Wendelin Knauer, Klaus H Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealistic ren- dering.Journal of Open Source Software, 8(82):4901,

  24. [24]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automa- tion (ICRA), pages 2553–2560. Ieee, 2022. 5, 1, 3

  25. [25]

    Real-time visual tracking of complex structures.IEEE Transactions on Pattern Analysis & Machine Intelligence, 24(07):932–946,

    Tom Drummond and Roberto Cipolla. Real-time visual tracking of complex structures.IEEE Transactions on Pattern Analysis & Machine Intelligence, 24(07):932–946,

  26. [26]

    Pizza: A powerful image-only zero-shot zero- cad approach to 6 dof tracking

    Yuming Du, Yang Xiao, Michael Ramamonjisoa, Vincent Lepetit, et al. Pizza: A powerful image-only zero-shot zero- cad approach to 6 dof tracking. In2022 International Con- ference on 3D Vision (3DV), pages 515–525. IEEE, 2022. 2

  27. [27]

    Rgb-de: Event camera calibration for fast 6-dof object tracking

    Etienne Dubeau, Mathieu Garon, Benoit Debaque, Raoul de Charette, and Jean-Franc ¸ois Lalonde. Rgb-de: Event camera calibration for fast 6-dof object tracking. In2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 127–135. IEEE, 2020. 1, 2, 3, 5

  28. [28]

    Real-time 6-dof pose estimation by an event-based camera using active led markers

    Gerald Ebmer, Adam Loch, Minh Nhat Vu, Roberto Mecca, Germain Haessig, Christian Hartl-Nesic, Markus Vincze, and Andreas Kugi. Real-time 6-dof pose estimation by an event-based camera using active led markers. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8137–8146, 2024. 3

  29. [29]

    Unsupervised event-based video reconstruction

    Gereon Fox, Xingang Pan, Ayush Tewari, Mohamed El- gharib, and Christian Theobalt. Unsupervised event-based video reconstruction. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 4179–4188, 2024. 3

  30. [30]

    Unified temporal and spatial calibration for multi-sensor systems

    Paul Furgale, Joern Rehder, and Roland Siegwart. Unified temporal and spatial calibration for multi-sensor systems. In2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1280–1286. IEEE, 2013. 1

  31. [31]

    Comparative anal- ysis of optitrack motion capture systems

    Joshua S Furtado, Hugh HT Liu, Gilbert Lai, Herve Lacheray, and Jason Desouza-Coelho. Comparative anal- ysis of optitrack motion capture systems. InAdvances in Motion Sensing and Control for Robotic Applications: Selected Papers from the Symposium on Mechatronics, Robotics, and Control (SMRC’18)-CSME International Congress 2018, May 27-30, 2018 Toronto, C...

  32. [32]

    Event-based vision: A survey.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 44 (1):154–180, 2020

    Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 44 (1):154–180, 2020. 1

  33. [33]

    Deep 6-dof tracking.IEEE transactions on visualization and computer graphics, 23(11):2410–2418, 2017

    Mathieu Garon and Jean-Franc ¸ois Lalonde. Deep 6-dof tracking.IEEE transactions on visualization and computer graphics, 23(11):2410–2418, 2017. 2

  34. [34]

    Low-latency au- tomotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024

    Daniel Gehrig and Davide Scaramuzza. Low-latency au- tomotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024. 3

  35. [35]

    Asynchronous, photometric feature tracking using events and frames

    Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Davide Scaramuzza. Asynchronous, photometric feature tracking using events and frames. InProceedings of the European Conference on Computer Vision (ECCV), pages 750–765, 2018. 2

  36. [36]

    Video to events: Recycling video datasets for event cameras

    Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carri ´o, and Davide Scaramuzza. Video to events: Recycling video datasets for event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3586–3595, 2020. 1

  37. [37]

    Recurrent vision transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13884–13893, 2023. 6

  38. [38]

    Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Da- vide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021. 3

  39. [39]

    E-raft: Dense optical flow from event cameras

    Mathias Gehrig, Mario Millh ¨ausler, Daniel Gehrig, and Da- vide Scaramuzza. E-raft: Dense optical flow from event cameras. In2021 International Conference on 3D Vision (3DV), pages 197–206. IEEE, 2021. 3

  40. [40]

    Edopt: Event-camera 6-dof dynamic object pose tracking

    Arren Glover, Luna Gava, Zhichao Li, and Chiara Bar- tolozzi. Edopt: Event-camera 6-dof dynamic object pose tracking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18200–18206. IEEE, 2024. 3

  41. [41]

    Edopt: Event-camera 6-dof dynamic object pose tracking

    Arren Glover, Luna Gava, Zhichao Li, and Chiara Bar- tolozzi. Edopt: Event-camera 6-dof dynamic object pose tracking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18200–18206. IEEE, 2024. 5

  42. [42]

    Deio: Deep event inertial odometry

    Weipeng Guan, Fuling Lin, Peiyu Chen, and Peng Lu. Deio: Deep event inertial odometry. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4606–4615, 2025. 3

  43. [43]

    Handal: A dataset of real-world manipulable object cate- gories with pose annotations, affordances, and reconstruc- tions

    Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. Handal: A dataset of real-world manipulable object cate- gories with pose annotations, affordances, and reconstruc- tions. In2023 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS), pages 11428–11435. IEEE, 2023. 1

  44. [44]

    Measuring depth accuracy in rgbd cameras

    Hussein Haggag, Mohammed Hossny, Despina Filippidis, Douglas Creighton, Saeid Nahavandi, and Vinod Puri. Measuring depth accuracy in rgbd cameras. In2013, 7th international conference on signal processing and commu- nication systems (ICSPCS), pages 1–7. IEEE, 2013. 1

  45. [45]

    Etap: Event- based tracking of any point

    Friedhelm Hamann, Daniel Gehrig, Filbert Febryanto, Kostas Daniilidis, and Guillermo Gallego. Etap: Event- based tracking of any point. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 27186–27196, 2025. 6, 1, 4

  46. [46]

    Honnotate: A method for 3d annotation of hand and object poses

    Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vin- cent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196–3206, 2020. 2, 5, 1

  47. [47]

    Rapid-a video rate object tracker

    Chris Harris and Carl Stennett. Rapid-a video rate object tracker. InBMVC, page 3, 1990. 2

  48. [48]

    E-pose: A large scale event camera dataset for object pose estimation.Scientific data, 12(1):245, 2025

    Oussama Abdul Hay, Xiaoqian Huang, Abdulla Ayyad, Es- lam Sherif, Randa Almadhoun, Yusra Abdulrahman, Lak- mal Seneviratne, Abdulqader Abusafieh, and Yahya Zweiri. E-pose: A large scale event camera dataset for object pose estimation.Scientific data, 12(1):245, 2025. 1, 2, 3, 5

  49. [49]

    Learning monocular dense depth from events

    Javier Hidalgo-Carri ´o, Daniel Gehrig, and Davide Scara- muzza. Learning monocular dense depth from events. In 2020 International Conference on 3D Vision (3DV), pages 534–542. IEEE, 2020. 3

  50. [50]

    Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

    Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian conference on computer vision, pages 548–562. Springer,

  51. [51]

    Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects

    Tomas Hodan, Martin Sundermeyer, Yann Labbe, Van Nguyen Nguyen, Gu Wang, Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, and Jiri Matas. Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5610–5619, 2024. 1, 5

  52. [52]

    Ede-distill: Boosting event-based monocular depth estimation performance via knowledge distillation

    Chenming Hu, Junjie Jiang, Yaohui Li, Mingyuan Sun, and Zheng Fang. Ede-distill: Boosting event-based monocular depth estimation performance via knowledge distillation. IEEE Robotics and Automation Letters, 2025. 3

  53. [53]

    Depth-based object tracking using a robust gaussian filter

    Jan Issac, Manuel W ¨uthrich, Cristina Garcia Cifuentes, Jeannette Bohg, Sebastian Trimpe, and Stefan Schaal. Depth-based object tracking using a robust gaussian filter. In2016 IEEE international conference on robotics and au- tomation (ICRA), pages 608–615. IEEE, 2016. 2

  54. [54]

    To- wards robust event-based networks for nighttime via un- paired day-to-night event translation

    Yuhwan Jeong, Hoonhee Cho, and Kuk-Jin Yoon. To- wards robust event-based networks for nighttime via un- paired day-to-night event translation. InEuropean Confer- ence on Computer Vision, pages 286–306. Springer, 2024. 3

  55. [55]

    What mat- ters in unsupervised optical flow

    Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel Gordon, Kurt Konolige, and Anelia Angelova. What mat- ters in unsupervised optical flow. InEuropean conference on computer vision, pages 557–572. Springer, 2020. 4

  56. [56]

    Tem- poral stereo matching from event cameras via joint learning with stereoscopic flow.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2025

    Jae-Young Kang, Hoonhee Cho, and Kuk-Jin Yoon. Tem- poral stereo matching from event cameras via joint learning with stereoscopic flow.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2025. 3

  57. [57]

    Un- leashing the temporal potential of stereo event cameras for continuous-time 3d object detection

    Jae-Young Kang, Hoonhee Cho, and Kuk-Jin Yoon. Un- leashing the temporal potential of stereo event cameras for continuous-time 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6869–6881, 2025. 3

  58. [58]

    Intel realsense stereo- scopic depth cameras

    Leonid Keselman, John Iselin Woodfill, Anders Grunnet- Jepsen, and Achintya Bhowmik. Intel realsense stereo- scopic depth cameras. InProceedings of the IEEE con- ference on computer vision and pattern recognition work- shops, pages 1–10, 2017. 1

  59. [59]

    Simultaneous mosaicing and track- ing with an event camera.J

    Hanme Kim, Ankur Handa, Ryad Benosman, Sio-Hoi Ieng, and Andrew J Davison. Simultaneous mosaicing and track- ing with an event camera.J. Solid State Circ, 43:566–576,

  60. [60]

    Real-time 3d reconstruction and 6-dof tracking with an event camera

    Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. InEuropean conference on computer vision, pages 349–364. Springer, 2016. 2, 3

  61. [61]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2

  62. [62]

    Deep event visual odometry

    Simon Klenk, Marvin Motzet, Lukas Koestler, and Daniel Cremers. Deep event visual odometry. In2024 Inter- national conference on 3D vision (3DV), pages 739–749. IEEE, 2024. 1

  63. [63]

    Cosypose: Consistent multi-view multi-object 6d pose estimation

    Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. InEuropean conference on computer vi- sion, pages 574–591. Springer, 2020. 2, 4

  64. [64]

    arXiv preprint arXiv:2212.06870 (2022)

    Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpen- tier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Mega- pose: 6d pose estimation of novel objects via render & com- pare.arXiv preprint arXiv:2212.06870, 2022. 1, 2, 3, 4, 6, 7

  65. [65]

    Category-level metric scale object shape and pose estimation.IEEE Robotics and Automation Letters, 6(4): 8575–8582, 2021

    Taeyeop Lee, Byeong-Uk Lee, Myungchul Kim, and In So Kweon. Category-level metric scale object shape and pose estimation.IEEE Robotics and Automation Letters, 6(4): 8575–8582, 2021. 1, 2

  66. [66]

    Tta-cope: Test-time adaptation for category-level object pose estimation

    Taeyeop Lee, Jonathan Tremblay, Valts Blukis, Bowen Wen, Byeong-Uk Lee, Inkyu Shin, Stan Birchfield, In So Kweon, and Kuk-Jin Yoon. Tta-cope: Test-time adaptation for category-level object pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21285–21295, 2023. 2

  67. [67]

    Delta: Demonstration and language- guided novel transparent object manipulation.arXiv preprint arXiv:2510.05662, 2025

    Taeyeop Lee, Gyuree Kang, Bowen Wen, Youngho Kim, Seunghyeok Back, In So Kweon, David Hyunchul Shim, and Kuk-Jin Yoon. Delta: Demonstration and language- guided novel transparent object manipulation.arXiv preprint arXiv:2510.05662, 2025. 1

  68. [68]

    Any6d: Model-free 6d pose estimation of novel objects

    Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6d: Model-free 6d pose estimation of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11633–11643, 2025. 2

  69. [69]

    Ep n p: An accurate o (n) solution to the p n p problem

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. International Journal of Computer Vision, 81(2):155–166,

  70. [70]

    Epnp: An accurate o (n) solution to the p n p problem.Inter- national Journal of Computer Vision, 81(2):155–166, 2009

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the p n p problem.Inter- national Journal of Computer Vision, 81(2):155–166, 2009. 4

  71. [71]

    Deepim: Deep iterative matching for 6d pose estimation

    Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. InProceedings of the European conference on computer vi- sion (ECCV), pages 683–698, 2018. 2, 4

  72. [72]

    6-dof object tracking with event-based optical flow and frames

    Zhichao Li, Arren Glover, Chiara Bartolozzi, and Lorenzo Natale. 6-dof object tracking with event-based optical flow and frames. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 18880– 18887. IEEE, 2025. 3

  73. [73]

    Sam-6d: Segment anything model meets zero-shot 6d object pose es- timation

    Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. Sam-6d: Segment anything model meets zero-shot 6d object pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27906– 27916, 2024. 1

  74. [74]

    Keypoint-based category-level object pose tracking from an rgb sequence with uncertainty estimation

    Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A Vela, and Stan Birchfield. Keypoint-based category-level object pose tracking from an rgb sequence with uncertainty estimation. In2022 International Conference on Robotics and Automation (ICRA), pages 1258–1264. IEEE, 2022. 2

  75. [75]

    Spatiotemporal registration for event-based visual odometry

    Daqi Liu, Alvaro Parra, and Tat-Jun Chin. Spatiotemporal registration for event-based visual odometry. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4937–4946, 2021. 3

  76. [76]

    High-rate monocu- lar depth estimation via cross frame-rate collaboration of frames and events.International Journal of Computer Vi- sion, 133(10):7332–7351, 2025

    Xu Liu, Xiaopeng Fan, Jianing Li, Dianze Li, Wei Zhang, Zhengyu Ma, and Yonghong Tian. High-rate monocu- lar depth estimation via cross frame-rate collaboration of frames and events.International Journal of Computer Vi- sion, 133(10):7332–7351, 2025. 3

  77. [77]

    T-esvo: improved event-based stereo visual odome- try via adaptive time-surface and truncated signed distance function.Advanced Intelligent Systems, 5(9):2300027,

    Zhe Liu, Dianxi Shi, Ruihao Li, Yi Zhang, and Shaowu Yang. T-esvo: improved event-based stereo visual odome- try via adaptive time-surface and truncated signed distance function.Advanced Intelligent Systems, 5(9):2300027,

  78. [78]

    Optical flow-guided 6dof ob- ject pose tracking with an event camera

    Zibin Liu, Banglei Guan, Yang Shang, Shunkun Liang, Zhenbao Yu, and Qifeng Yu. Optical flow-guided 6dof ob- ject pose tracking with an event camera. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6501–6509, 2024. 3

  79. [79]

    Line-based 6-dof object pose estimation and tracking with an event camera.IEEE Transactions on Im- age Processing, 33:4765–4780, 2024

    Zibin Liu, Banglei Guan, Yang Shang, Qifeng Yu, and Lau- rent Kneip. Line-based 6-dof object pose estimation and tracking with an event camera.IEEE Transactions on Im- age Processing, 33:4765–4780, 2024. 5

  80. [80]

    Line-based 6-dof object pose estimation and tracking with an event camera.IEEE Transactions on Im- age Processing, 33:4765–4780, 2024

    Zibin Liu, Banglei Guan, Yang Shang, Qifeng Yu, and Lau- rent Kneip. Line-based 6-dof object pose estimation and tracking with an event camera.IEEE Transactions on Im- age Processing, 33:4765–4780, 2024. 3

Showing first 80 references.