OctoSense: Self-Supervised Learning for Multimodal Robot Perception
Pith reviewed 2026-06-26 05:23 UTC · model grok-4.3
The pith
A late-fusion masked autoencoder with modality-specific tokenizers produces fast multimodal representations that outperform image-only models on robot perception tasks and remain robust when sensors degrade.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a late-fusion masked autoencoder that uses separate tokenizers for each sensor modality to account for their distinct spatiotemporal properties, the model learns unified representations from the OctoSense dataset; these representations support faster inference through token caching and deliver higher performance than image-only foundation models on downstream tasks while maintaining robustness under nighttime conditions or sensor degradation.
What carries the argument
Late-fusion masked autoencoder with modality-specific tokenizers and cached token inference
If this is right
- Representations can be computed in 6.68 ms on a high-end GPU and 112 ms on an embedded Orin NX board.
- Performance exceeds image-only models on optical flow, depth, semantic segmentation, and ego-motion estimation.
- Predictions remain reliable at night and when individual sensors are degraded.
- New measurements can be incorporated by caching modality-specific tokens without recomputing the entire sequence.
Where Pith is reading between the lines
- The same token-caching mechanism could support online adaptation on a moving robot by updating only the newest modality tokens.
- The 59-hour dataset spanning day, night, and degraded conditions provides a ready benchmark for testing whether other multimodal architectures also gain robustness from late fusion.
- Because each tokenizer is trained independently before fusion, the method could be extended by adding new sensor types without retraining the entire model from scratch.
Load-bearing premise
That separate tokenizers per sensor plus late fusion inside a masked autoencoder will automatically produce representations that transfer to better performance on the listed downstream tasks than single-modality training.
What would settle it
An evaluation on the same test splits that shows the multimodal model achieving equal or lower accuracy than the best image-only baseline on optical flow, depth estimation, semantic segmentation, and ego-motion metrics.
Figures
read the original abstract
We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OctoSense, an open-source multimodal sensor platform and 59-hour dataset of time-synchronized driving data (stereo RGB, event cameras, LiDAR, thermal, IMU, RTK-GPS, proprioception) collected across varied environments and times of day. It proposes a late-fusion masked autoencoder that employs modality-specific tokenizers to handle differing spatiotemporal characteristics and caches modality-specific tokens for efficient online inference. The central claims are that this architecture runs at 6.68 ms (NVIDIA 5090) / 112 ms (Orin NX), outperforms existing image-only foundation models on optical flow, depth, semantic segmentation, and ego-motion (translation/rotation/steering), and remains robust under nighttime or degraded-sensor conditions.
Significance. If the empirical superiority and robustness claims are substantiated with quantitative evidence, the work would supply a large-scale, real-world multimodal robotics dataset and a practical architecture for heterogeneous sensor fusion in self-supervised learning, with potential impact on robust perception for autonomous driving and legged robots.
major comments (1)
- [Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.
minor comments (1)
- [Abstract] The manuscript references a project page for dataset, code, and videos but does not include any quantitative results or experimental protocol in the provided text, which should be added to the main body or supplementary material.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the need for verifiable support of the abstract claims. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.
Authors: We agree that the abstract claims require direct quantitative support to be verifiable. The manuscript contains an Experiments section with the requested elements (comparisons against image-only MAE/DINO baselines on optical flow EPE, depth RMSE, segmentation mIoU, and ego-motion errors; robustness ablations under nighttime/degraded conditions; data splits; and training details). However, these were not sufficiently cross-referenced from the abstract. In the revision we will (i) insert key numerical results and error bars into the abstract, (ii) add an explicit pointer to the Experiments section and supplementary tables, and (iii) ensure all baselines, splits, and ablation results are clearly tabulated. This addresses the verifiability concern without altering the underlying claims. revision: yes
Circularity Check
No significant circularity; claims are empirical.
full rationale
The paper presents an empirical architecture (late-fusion masked autoencoder with modality-specific tokenizers and caching) and reports performance gains on downstream tasks versus image-only models. No load-bearing derivation, prediction, or uniqueness result reduces by construction to fitted inputs, self-citations, or ansatzes. All central claims rest on external experimental comparisons on the 59-hour dataset rather than any self-referential equation or parameter renaming. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Caron, H
M. Caron, H. Touvron, I. Misra, H. J´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, 2021
2021
-
[2]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[3]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. M. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H’enaff, J. Harmsen, A. Steiner, and X.-Q. Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint 2502.14786, 2025. 8Cross-dataset generaliz...
Pith/arXiv arXiv 2025
-
[4]
Ryali, Y.-T
C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, 2023
2023
-
[5]
Y. Liu, S. Wang, Y. Xie, T. Xiong, and M. Wu. A review of sensing technologies for indoor autonomous mobile robots.Sensors, 24, 2024
2024
-
[6]
H. I. Christensen. Global robotics technology roadmap 2025–2035: A multi-regional, cross-domain strategic perspective for europe, asia, and the united states. Technology roadmap, University of California San Diego, April 2026. Version 1.02
2025
-
[7]
Bachmann, D
R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean Conference on Computer Vision, 2022
2022
-
[8]
Girdhar, A
R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra. ImageBind one embedding space to bind them all.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[9]
Oquab, T
M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without superv...
2025
-
[10]
O. Sim´eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint 2508.10104, 2025
Pith/arXiv arXiv 2025
-
[11]
Bolya, P.-Y
D. Bolya, P.-Y. Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. A. Rasheed, J. Wang, M. Monteiro, H. Xu, S. Dong, N. Ravi, S.-W. Li, P. Doll’ar, and C. Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InAdvances in Neural Information Processing Systems, 2025
2025
-
[12]
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. K. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll’ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025
2025
-
[13]
K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[14]
Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. SimMIM: a simple framework for masked image modeling.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[15]
Devlin, M.-W
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019
2019
-
[16]
A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint 1807.03748, 2018
Pith/arXiv arXiv 2018
-
[17]
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, 2020
2020
-
[18]
J. Cao, J. Xing, N. Messikommer, and D. Scaramuzza. Generative event pretraining with foundation model alignment.arXiv preprint 2603.23032, 2026
Pith/arXiv arXiv 2026
-
[19]
Klenk, D
S. Klenk, D. Bonello, L. Koestler, and D. Cremers. Masked Event Modeling: Self-supervised pretraining for event cameras.IEEE/CVF Winter Conference on Applications of Computer Vision, 2022
2022
-
[20]
Y. Yang, L. Pan, and L. Liu. Event camera data dense pre-training. InEuropean Conference on Computer Vision, 2024
2024
-
[21]
R. Das, K. Daniilidis, and P. Chaudhari. Fast feature field (F3): A predictive representation of events.arXiv preprint 2509.25146, 2025. 12
arXiv 2025
- [22]
-
[23]
Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InEuropean Conference on Computer Vision, 2022
2022
-
[24]
X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[25]
H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner. Unsupervised point cloud pre-training via occlusion completion. InIEEE/CVF International Conference on Computer Vision, 2021
2021
-
[26]
S. Xie, J. Gu, D. Guo, C. Qi, L. J. Guibas, and O. Litany. PointContrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean Conference on Computer Vision, 2020
2020
-
[27]
Munir, S
F. Munir, S. Azam, and M. Jeon. Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving.IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021
2021
-
[28]
Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving
J. Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving. PhD thesis, University of Freiburg, 2024
2024
-
[29]
Narayanswamy, X
G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, T. Althoff, S. Narayanan, P. Kohli, J. Zhan, M. Malhotra, S. N. Patel, S. Abdel-Ghaffar, and D. McDuff. Scaling wearable foundation models. InInternational Conference on Learning Representations, 2025
2025
-
[30]
H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen. LIMU-BERT: Unleashing the potential of unlabeled data for imu sensing applications. InACM Conference on Embedded Networked Sensor Systems, 2021
2021
-
[31]
Y. Zong, O. M. Aodha, and T. M. Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:5299–5318, 2023
2023
-
[32]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021
2021
-
[33]
X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
2021
-
[34]
Mizrahi, R
D. Mizrahi, R. Bachmann, O. F. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir. 4M: Massively multimodal masked modeling. InAdvances in Neural Information Processing Systems, 2023
2023
-
[35]
H. Bao, L. Dong, S. Piao, and F. Wei. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations, 2022
2022
-
[36]
J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. InInternational Conference on Learning Representations, 2023
2023
-
[37]
J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi. Unified-IO 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[38]
J. Zou, T. Huang, G. Yang, Z. Guo, and W. Zuo. UniM2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. InEuropean Conference on Computer Vision, 2024
2024
-
[39]
J. Sun, H. Zheng, Q. Zhang, A. Prakash, Z. M. Mao, and C. Xiao. CALICO: Self-supervised camera-lidar contrastive pre-training for bev perception. InInternational Conference on Learning Representations, 2024
2024
-
[40]
Geiger, P
A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012
2012
-
[41]
Caesar, V
H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. 13
2020
-
[42]
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset.2020 IEEE/CVF Conference on Compu...
2020
-
[43]
B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting.ArXiv, abs/2301.00493, 2023
Pith/arXiv arXiv 2023
-
[44]
W. P. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset.The International Journal of Robotics Research, 36:15 – 3, 2017
2017
- [45]
-
[46]
Carlevaris-Bianco, A
N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice. University of Michigan North Campus long-term vision and lidar dataset.The International Journal of Robotics Research, 35:1023 – 1035, 2016
2016
-
[47]
Triest, M
S. Triest, M. Sivaprakasam, S. J. Wang, W. Wang, A. M. Johnson, and S. A. Scherer. TartanDrive: A large-scale dataset for learning off-road dynamics models.IEEE International Conference on Robotics and Automation, 2022
2022
-
[48]
Sivaprakasam, P
M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. A. Scherer. TartanDrive 2.0: More modalities and better infrastructure to further self-supervised learning research in off-road driving tasks.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12606–12606, 2024
2024
-
[49]
Diaz-Ruiz, Y
C. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond, W.-L. Chao, B. Hariharan, K. Q. Weinberger, and M. E. Campbell. Ithaca365: Dataset and driving perception under repeated and challenging weather conditions.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[50]
H. Schafer, E. Santana, A. Haden, and R. Biasini. A commute in data: The comma2k19 dataset.ArXiv, abs/1812.05752, 2018
Pith/arXiv arXiv 2018
-
[51]
PhysicalAI-Autonomous-Vehicles dataset
NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, 2025
2025
-
[52]
Gehrig, W
M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios.IEEE Robot. and Autom. Lett., March 2021
2021
-
[53]
A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robt. and Autom. Lett., 3:2032–2039, Feb. 2018
2032
-
[54]
L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip. VECtor: A versatile event-centric benchmark for multi-sensor slam.IEEE Robot. and Autom. Lett., 7(3):8217–8224, June 2022
2022
-
[55]
P. Chen, W. Guan, F. Huang, Y. Zhong, W. W. Wen, L.-T. Hsu, and P. Lu. ECMD: An event-centric multisensory driving dataset for slam.IEEE Transactions on Intelligent Vehicles, 9:407–416, 2023. URL https://api.semanticscholar.org/CorpusID:265033288
2023
-
[56]
Chaney, F
K. Chaney, F. Cladera, Z. Wang, A. Bisulco, M. A. Hsieh, C. Korpela, V. Kumar, C. J. Taylor, and K. Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE Conf. Comput. Vis. Pattern Recog. Workshop
-
[57]
A. J. Lee, Y. Cho, Y. sik Shin, A. Kim, and H. Myung. ViViD++ : Vision for visibility dataset.IEEE Robotics and Automation Letters, 7:6282–6289, 2022
2022
-
[58]
Perot, P
E. Perot, P. de Tournemire, D. O. Nitti, J. Masci, and A. Sironi. Learning to detect objects with a 1 megapixel event camera.Neural Information Processing Systems, 2020
2020
-
[59]
J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD17: End-to-end davis driving dataset.ArXiv, abs/1711.01458, 2017. 14
Pith/arXiv arXiv 2017
-
[60]
Y. Hu, J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD20 End-to-End Event Camera Driving Dataset: Fusing frames and events with deep learning for improved steering prediction.2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–6, 2020
2020
-
[61]
Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding
ITU-T. Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding. Technical Report ITU-T H.265, International Telecommunication Union, 2026. Version 01/2026
2026
-
[62]
E. Olson. AprilTag: A robust and flexible visual fiducial system.IEEE International Conference on Robotics and Automation, 2011
2011
- [63]
-
[64]
Rehder, J
J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart. Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. InIEEE International Conference on Robotics and Automation
-
[65]
Furgale, J
P. Furgale, J. Rehder, and R. Siegwart. Unified temporal and spatial calibration for multi-sensor systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1280–1286, 2013
2013
-
[66]
W. Kabsch. A solution for the best rotation to relate two sets of vectors.Acta Crystallographica Section A, 32:922–923, 1976
1976
-
[67]
S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:376–380, 1991
1991
-
[68]
Levenberg
K. Levenberg. A method for the solution of certain non – linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168, 1944
1944
-
[69]
Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Open model release
2026
-
[70]
Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint 2506.05176, 2025
Pith/arXiv arXiv 2025
-
[71]
Douze, A
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J´egou. The Faiss library. 2024
2024
-
[72]
Malladi, T
M. Malladi, T. Guadagnino, L. Lobefaro, and C. Stachniss. A robust approach for lidar-inertial odometry without sensor-specific modeling.IEEE Robotics and Automation Letters, 11(6):7420–7427, 2026
2026
-
[73]
R. Sapkota, R. H. Cheppally, A. Sharda, and M. Karkee. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint 2509.25164, 2025
arXiv 2025
-
[74]
Kerssies, N
T. Kerssies, N. Cavagnero, A. Hermans, N. Norouzi, G. Averta, B. Leibe, G. Dubbelman, and D. de Geus. Your ViT is Secretly an Image Segmentation Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[75]
Dosovitskiy, L
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2021
2021
-
[76]
Hawkes and P
T. Hawkes and P. Simonpieri. Signal coding using asynchronous delta modulation.IEEE Trans. on Comm., 22(5):729–731, March 1974
1974
-
[77]
Gallego, T
G. Gallego, T. Delbr¨ uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022
2022
-
[78]
Delbruck
T. Delbruck. Frame-free dynamic digital vision. InProceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, pages 21–26, 2008
2008
-
[79]
Gerstner, W
W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski.Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, 2014. 15
2014
-
[80]
Lagorce, G
X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman. HOTS: A hierarchy of event-based time-surfaces for pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1346–1359, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.