pith. machine review for the scientific record. sign in

arxiv: 2604.03814 · v1 · submitted 2026-04-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords relative camera pose estimationin-cabin monitoringfisheye camerastransformer decodersynthetic dataextrinsic calibrationmetric scaleautomotive vision
0
0 comments X

The pith

InCaRPose estimates absolute metric-scale relative poses for in-cabin fisheye cameras in one inference step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InCaRPose, a Transformer-based model that predicts the relative pose between pairs of highly distorted fisheye images taken inside a vehicle cabin. It uses frozen features from a backbone such as DINOv3 processed by a Transformer decoder to recover the geometric relationship between a reference and target view. The central advance is that this produces absolute metric-scale translation values that fall inside the physically plausible adjustment range of typical in-cabin camera mounts, all in a single forward pass. Such metric accuracy matters for safety-critical perception tasks in automotive interior monitoring, where real-world distances directly affect driver or occupant detection. The model is trained exclusively on synthetic data yet generalizes to real cabin scenes without needing identical camera intrinsics, runs in real time even with a small backbone, and performs competitively on the public 7-Scenes benchmark.

Core claim

InCaRPose is a Transformer-based architecture for robust relative pose prediction between image pairs from fisheye cameras in automotive interiors. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, the model captures the geometric relationship between a reference and a target view. It achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step. Trained exclusively on synthetic data, it generalizes to real-world cabin environments without relying on the exact same camera intrinsics, maintains high precision in both rotation and translation even with a ViT-Small backbone, 0

What carries the argument

Transformer-based decoder that processes frozen DINOv3 backbone features to predict the geometric relationship and metric-scale translation between reference and target views.

If this is right

  • Delivers metric-scale translation in one step, removing the need for multi-view or iterative calibration routines in in-cabin settings.
  • Generalizes from synthetic training data to real distorted cabin images without requiring matched camera intrinsics.
  • Achieves competitive accuracy on the public 7-Scenes dataset while using a small backbone suitable for real-time inference.
  • Enables precise distance measurements required for safety-relevant perception tasks such as driver monitoring.
  • Provides a new public real-world test dataset of highly distorted vehicle-interior image pairs for further research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-step metric output could simplify factory calibration workflows for vehicle interior cameras by removing the need for specialized rigs.
  • The same synthetic-to-real transfer pattern might extend to pose estimation in other fixed-mount environments such as aircraft cabins or industrial inspection setups.
  • If the metric scale remains reliable under small mount shifts, the approach could support periodic online recalibration during vehicle operation without stopping the car.
  • Success with limited synthetic data suggests that domain-specific geometric tasks in constrained spaces may not always require large volumes of real labeled imagery.

Load-bearing premise

Training exclusively on synthetic data produces a model that generalizes pose estimation to real fisheye images in vehicle cabins even when the camera intrinsics do not match the training setup.

What would settle it

A quantitative test on real in-cabin fisheye images with varied intrinsics and mount positions where the predicted translation error exceeds the typical physical adjustment range of cabin camera mounts on a majority of samples.

Figures

Figures reproduced from arXiv: 2604.03814 by Felix Stillger, Frederik Hasecke, Lukas Hahn, Tobias Meisen.

Figure 1
Figure 1. Figure 1: Our InCaRPose predicts the relative camera pose between [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Camera coordinate system of the standard view compared [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: InCaRPose’s architecture overview. Two images are encoded by a frozen ViT backbone and fused by a cross-attention Transformer [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on real-world inference. All translation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: COLMAP fails to estimate translation along the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of preprocessing methods. (a) and (b): im [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rotation error in degrees versus the number of parameters. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ArUco (orange) vs. COLMAP (blue) camera trajectories for intervals focused on specific transformations. Each sequence [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference on frames with physically occluded ArUco markers. We also changed the reference image, in which a different object [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InCaRPose, a Transformer decoder operating on frozen DINOv3 (or similar) backbone features to predict relative camera pose from fisheye image pairs captured inside vehicle cabins. Trained exclusively on synthetic data with varied intrinsics, the model claims to recover absolute metric-scale translations within the physically plausible range of in-cabin mount adjustments in a single forward pass, to generalize to real-world distorted cabin imagery without matching intrinsics, and to achieve competitive accuracy on the 7-Scenes benchmark while enabling real-time inference with a ViT-Small backbone. The authors also release a new real-world In-Cabin-Pose test dataset and accompanying code.

Significance. If the central claims are substantiated, the work would be a useful contribution to constrained-environment extrinsic calibration for automotive interior monitoring. Strengths include the release of a real test set and code, the use of frozen backbones for efficiency, training with varied synthetic intrinsics to encourage generalization, and the direct production of metric-scale output without post-processing or scale recovery steps. These elements address practical needs in safety-critical perception where accurate real-world distances matter.

major comments (2)
  1. [Abstract and Results] Abstract and Results sections: competitive performance is asserted on both the new in-cabin test set and 7-Scenes, yet no quantitative error bars, standard deviations across runs, or statistical tests are reported; this omission directly weakens confidence in the metric-scale translation claim and the generalization statement.
  2. [Methods and Experiments] Methods and Experiments: the claim that synthetic-only training suffices for real-world metric-scale recovery rests on the model internalizing cabin geometry and distortion statistics, but the manuscript provides insufficient ablation or cross-intrinsic validation (e.g., testing on real images whose intrinsics differ substantially from the synthetic distribution) to confirm that the scale is not an artifact of the training distribution.
minor comments (2)
  1. Figure captions and legends should explicitly state the units and reference frames used for translation error (meters) and rotation error (degrees) to avoid ambiguity when comparing to prior work.
  2. A direct comparison table against classical methods (e.g., essential-matrix decomposition followed by scale recovery) on the released real test set would strengthen the practical advantage claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, agreeing on the need for stronger statistical reporting and additional validation experiments. We commit to incorporating these changes in the revised version.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results sections: competitive performance is asserted on both the new in-cabin test set and 7-Scenes, yet no quantitative error bars, standard deviations across runs, or statistical tests are reported; this omission directly weakens confidence in the metric-scale translation claim and the generalization statement.

    Authors: We agree that the lack of error bars and statistical analysis weakens the presentation of our results. In the revised manuscript we will report standard deviations across multiple independent training runs (different random seeds) for all reported metrics on both the In-Cabin-Pose test set and 7-Scenes. We will also add paired statistical tests against the baselines to quantify the significance of the observed improvements in metric-scale translation and generalization. revision: yes

  2. Referee: [Methods and Experiments] Methods and Experiments: the claim that synthetic-only training suffices for real-world metric-scale recovery rests on the model internalizing cabin geometry and distortion statistics, but the manuscript provides insufficient ablation or cross-intrinsic validation (e.g., testing on real images whose intrinsics differ substantially from the synthetic distribution) to confirm that the scale is not an artifact of the training distribution.

    Authors: We acknowledge that the current manuscript would benefit from more explicit cross-intrinsic validation. Our training already samples a broad range of synthetic intrinsics, and the released real test set contains images whose intrinsics lie outside that exact distribution. To directly address the concern, the revision will include a dedicated ablation that evaluates the model on real images with substantially different focal lengths and distortion parameters, demonstrating that metric-scale recovery generalizes rather than being an artifact of the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central pipeline trains a Transformer decoder on frozen DINOv3 features using supervised synthetic data where ground-truth relative poses (including metric-scale translations) are known by construction from the rendering process. The absolute metric output is therefore a direct consequence of this external supervision rather than a self-defined or fitted quantity. Generalization claims are evaluated on a separately released real-world test set and the public 7-Scenes benchmark, with no load-bearing self-citations, uniqueness theorems, or ansatz smuggling required to close the derivation. The approach remains self-contained against external benchmarks and does not reduce any prediction to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are identifiable from the abstract; the approach relies on standard Transformer components and a frozen public backbone.

pith-pipeline@v0.9.0 · 5566 in / 1065 out tokens · 44908 ms · 2026-05-13T17:07:28.777278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 4 internal anchors

  1. [1]

    Netvlad: Cnn architecture for weakly supervised place recognition

    Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016. 2

  2. [2]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Dani- yar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022. 2

  3. [3]

    Relocnet: Continuous metric learning relocalisation using neural nets

    Vassileios Balntas, Shuda Li, and Victor Prisacariu. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European conference on computer vision (ECCV), pages 751–767, 2018. 2

  4. [4]

    Multi-hmr: Multi-person whole-body human mesh re- covery in a single shot

    Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br´egier, Philippe Weinzaepfel, Gr´egory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh re- covery in a single shot. InEuropean Conference on Computer Vision, pages 202–218. Springer, 2024. 3

  5. [5]

    Magsac++, a fast, reliable and accurate robust estima- tor

    Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. Magsac++, a fast, reliable and accurate robust estima- tor. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1304–1312, 2020. 2

  6. [6]

    Performance of event data recorders found in toyota airbag control modules in high severity frontal oblique offset crash tests

    William Bortles and Ryan Hostetler. Performance of event data recorders found in toyota airbag control modules in high severity frontal oblique offset crash tests. Technical report, SAE Technical Paper, 2019. 2

  7. [7]

    Learning less is more-6d camera localization via 3d surface regression

    Eric Brachmann and Carsten Rother. Learning less is more-6d camera localization via 3d surface regression. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4654–4662, 2018. 2

  8. [8]

    Visual camera re- localization from rgb and rgb-d images using dsac.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5847–5865, 2021

    Eric Brachmann and Carsten Rother. Visual camera re- localization from rgb and rgb-d images using dsac.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5847–5865, 2021

  9. [9]

    Dsac-differentiable ransac for camera localization

    Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 6684–6692, 2017. 2

  10. [10]

    G. Bradski. The OpenCV Library.Dr. Dobb’s Journal of Software Tools, 2000. 4

  11. [11]

    Wide- baseline relative camera pose estimation with directional learning

    Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3258–3268,

  12. [12]

    Dfnet: Enhance absolute pose regression with direct feature matching

    Shuai Chen, Xinghui Li, Zirui Wang, and Victor A Prisacariu. Dfnet: Enhance absolute pose regression with direct feature matching. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2022. 2

  13. [13]

    Neural refinement for absolute pose regression with feature synthesis

    Shuai Chen, Yash Bhalgat, Xinghui Li, Jia-Wang Bian, Ke- jie Li, Zirui Wang, and Victor Adrian Prisacariu. Neural refinement for absolute pose regression with feature synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20987–20996, 2024. 2

  14. [14]

    Map-relative pose regression for visual re-localization

    Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-relative pose regression for visual re-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20665– 20674, 2024. 2

  15. [15]

    Recording automotive crash event data

    Augustus Chidester, John Hinch, Thomas C Mercer, and Keith S Schultz. Recording automotive crash event data. In Transportation Recording: 2000 and Beyond. International Symposium on Transportation RecordersNational Transporta- tion Safety BoardInternational Transportation Safety Associ- ation, 1999. 2

  16. [16]

    Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

    Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 3

  17. [17]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16739–16752, 2025. 2, 6, 7, 8, 13

  18. [18]

    D2-net: A trainable cnn for joint detection and description of local fea- tures

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local fea- tures. InCVPR 2019-IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2

  19. [19]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014. 2

  20. [20]

    Rpnet: An end-to-end network for relative camera pose estimation

    Sovann En, Alexis Lechervy, and Fr´ed´eric Jurie. Rpnet: An end-to-end network for relative camera pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018. 7, 8

  21. [21]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 5

  22. [22]

    Auto- matic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recognition, 47(6):2280– 2292, 2014

    Sergio Garrido-Jurado, Rafael Mu˜noz-Salinas, Francisco Jos´e Madrid-Cuevas, and Manuel Jes ´us Mar´ın-Jim´enez. Auto- matic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recognition, 47(6):2280– 2292, 2014. 3

  23. [23]

    Generation of fiducial marker dictionaries using mixed integer linear programming.Pattern recognition, 51:481–491, 2016

    Sergio Garrido-Jurado, Rafael Munoz-Salinas, Francisco Jos´e Madrid-Cuevas, and Rafael Medina-Carnicer. Generation of fiducial marker dictionaries using mixed integer linear programming.Pattern recognition, 51:481–491, 2016. 4

  24. [24]

    JHU press, 2013

    Gene H Golub and Charles F Van Loan.Matrix computations. JHU press, 2013. 5

  25. [25]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geom- etry in computer vision. Cambridge university press, 2003. 2

  26. [26]

    Revisiting Multimodal Positional Encoding in Vision-Language Models

    Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision-language models.arXiv preprint arXiv:2510.23095, 2025. 3 9

  27. [27]

    Robust im- age retrieval-based visual localization using kapture.arXiv preprint arXiv:2007.13867, 2020

    Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Vincent Leroy, J´erˆome Revaud, Philippe Rerole, No´e Pion, Cesar De Souza, and Gabriela Csurka. Robust im- age retrieval-based visual localization using kapture.arXiv preprint arXiv:2007.13867, 2020. 2

  28. [28]

    Learning to localize in unseen scenes with relative pose regressors.arXiv preprint arXiv:2303.02717, 2023

    Ofer Idan, Yoli Shavit, and Yosi Keller. Learning to localize in unseen scenes with relative pose regressors.arXiv preprint arXiv:2303.02717, 2023. 2, 7

  29. [29]

    Beyond familiar land- scapes: Exploring the limits of relative pose regressors in new environments.Computer Vision and Image Understanding, page 104629, 2026

    Ofer Idan, Yoli Shavit, and Yosi Keller. Beyond familiar land- scapes: Exploring the limits of relative pose regressors in new environments.Computer Vision and Image Understanding, page 104629, 2026. 2

  30. [30]

    In-cabin sensing 2024–2034: Technologies, op- portunities and markets

    IDTechEx. In-cabin sensing 2024–2034: Technologies, op- portunities and markets. Technical report, IDTechEx Re- search, 2024. Market analysis report projecting rapid growth of in-cabin sensing driven by regulation and OEM adoption. 1

  31. [31]

    Road vehicles — vehicle dynamics and road-holding ability — vocabulary,

    International Organization for Standardization. Road vehicles — vehicle dynamics and road-holding ability — vocabulary,

  32. [32]

    Geometric loss functions for camera pose regression with deep learning

    Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5974–5983, 2017. 2

  33. [33]

    Posenet: A convolutional network for real-time 6-dof camera relocal- ization

    Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocal- ization. InProceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015. 2, 5, 7

  34. [34]

    Princeton university press, 1999

    Jack B Kuipers.Quaternions and rotation sequences: a primer with applications to orbits, aerospace, and virtual reality. Princeton university press, 1999. 5

  35. [35]

    Camera relocalization by computing pairwise rel- ative poses using convolutional neural network

    Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, and Juho Kannala. Camera relocalization by computing pairwise rel- ative poses using convolutional neural network. InProceed- ings of the IEEE international conference on computer vision workshops, pages 929–938, 2017. 2, 7

  36. [36]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean confer- ence on computer vision, pages 71–91. Springer, 2024. 2, 3, 6

  37. [37]

    An analysis of svd for deep rotation estimation.Advances in Neural Information Processing Systems, 33:22554–22565,

    Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, and Ameesh Maka- dia. An analysis of svd for deep rotation estimation.Advances in Neural Information Processing Systems, 33:22554–22565,

  38. [38]

    Learning neural volumetric pose features for camera localization

    Jingyu Lin, Jiaqi Gu, Bojian Wu, Lubin Fan, Renjie Chen, Ligang Liu, and Jieping Ye. Learning neural volumetric pose features for camera localization. InEuropean Conference on Computer Vision, pages 198–214. Springer, 2024. 2

  39. [39]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  40. [40]

    Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. 2, 5, 6

  41. [41]

    Relative camera pose estimation using convolu- tional neural networks

    Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Relative camera pose estimation using convolu- tional neural networks. InInternational Conference on Ad- vanced Concepts for Intelligent Vision Systems, pages 675–

  42. [42]

    Lens: Localization enhanced by nerf synthesis

    Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on robot learning, pages 1347–1356. PMLR, 2022. 2

  43. [43]

    Fast approximate nearest neighbors with automatic algorithm configuration

    Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. InInterna- tional conference on computer vision theory and applications, pages 331–340. Scitepress, 2009. 5

  44. [44]

    Deep learning- based gaze detection system for automobile drivers using a nir camera sensor.Sensors, 18(2):456, 2018

    Rizwan Ali Naqvi, Muhammad Arsalan, Ganbayar Batchu- luun, Hyo Sik Yoon, and Kang Ryoung Park. Deep learning- based gaze detection system for automobile drivers using a nir camera sensor.Sensors, 18(2):456, 2018. 1

  45. [45]

    An efficient solution to the five-point relative pose problem.IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004

    David Nist´er. An efficient solution to the five-point relative pose problem.IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004. 5

  46. [46]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 3, 5

  47. [47]

    Marcin Piotrowski, Lukasz Dziuda, and Paulina Baran. Au- tomotive interior monitoring systems: A review of selected technical solutions for the recognition of fatigue symptoms in motor vehicle drivers.The Polish Journal of Aviation Medicine, Bioengineering and Psychology, 28:31–41, 2025. 1

  48. [48]

    Olinde Rodrigues. Des lois g ´eom´etriques qui r ´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la vari- ation des coordonn ´ees provenant de ces d ´eplacements con- sid´er´es ind´ependamment des causes qui peuvent les produire. Journal de math´ematiques pures et appliqu´ees, 5:380–440,

  49. [49]

    In-cabin sensing has automakers look- ing inward.RTInsights, 2025

    Salvatore Salamone. In-cabin sensing has automakers look- ing inward.RTInsights, 2025. Industry analysis of OEM adoption of in-cabin sensing driven by safety ratings and semi-autonomous driving. 1

  50. [50]

    Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers

    Mert B¨ulent Sarıyıldız, Philippe Weinzaepfel, Thomas Lucas, Pau De Jorge, Diane Larlus, and Yannis Kalantidis. Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30084–30094, 2025. 3, 5, 6

  51. [51]

    Superglue: Learning feature match- ing with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature match- ing with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 4938–4947, 2020. 2

  52. [52]

    Understanding the limitations of cnn-based absolute camera pose regression

    Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal- Taixe. Understanding the limitations of cnn-based absolute camera pose regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3302–3312, 2019. 2

  53. [53]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 3 10

  54. [54]

    Scene coordinate regression forests for camera relocalization in rgb- d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 5, 6

  55. [55]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 3, 4, 5, 8

  56. [56]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  57. [57]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931,

  58. [58]

    Active nir illu- mination for improved camera view in automated driving application

    Max C Sundermeier, Hauke Dierend, Peer-Phillip Ley, Alexander Wolf, and Roland Lachmayer. Active nir illu- mination for improved camera view in automated driving application. InLight-Emitting Devices, Materials, and Appli- cations XXVI, pages 54–62. SPIE, 2022. 1

  59. [59]

    Validation of event data recorders in high severity full-frontal crash tests.SAE International Journal of Transportation Safety, 1(2013-01-1265):76–99, 2013

    Ada Tsoi, John Hinch, Richard Ruth, and Hampton Gabler. Validation of event data recorders in high severity full-frontal crash tests.SAE International Journal of Transportation Safety, 1(2013-01-1265):76–99, 2013. 2, 6

  60. [60]

    Visual camera re- localization using graph neural networks and relative pose supervision

    Mehmet Ozgur Turkoglu, Eric Brachmann, Konrad Schindler, Gabriel J Brostow, and Aron Monszpart. Visual camera re- localization using graph neural networks and relative pose supervision. In2021 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2021. 2, 7

  61. [61]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

  62. [62]

    Absolute pose from one or two scaled and oriented features

    Jonathan Ventura, Zuzana Kukelova, Torsten Sattler, and D´aniel Bar´ath. Absolute pose from one or two scaled and oriented features. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 20870–20880, 2024. 2

  63. [63]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 20697–20709, 2024. 2, 6

  64. [64]

    Summarizing regional regulations for mandating driver monitoring systems.Automation.com, 2023

    Yulin Wang. Summarizing regional regulations for mandating driver monitoring systems.Automation.com, 2023. Overview of global driver monitoring regulations including EU GSR and ADDW requirements. 1

  65. [65]

    Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022

    Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J´erˆome Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022. 2

  66. [66]

    Learning to localize in new environments from syn- thetic training data

    Dominik Winkelbauer, Maximilian Denninger, and Rudolph Triebel. Learning to localize in new environments from syn- thetic training data. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5840–5846. IEEE,

  67. [67]

    Spatialformer: Towards generalizable vision transformers with explicit spatial understanding

    Han Xiao, Wenzhao Zheng, Sicheng Zuo, Peng Gao, Jie Zhou, and Jiwen Lu. Spatialformer: Towards generalizable vision transformers with explicit spatial understanding. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024. 3

  68. [68]

    Towards robust probabilistic modeling on so (3) via rotation laplace distribution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3469–3486,

    Yingda Yin, Jiangran Lyu, Yang Wang, Haoran Liu, He Wang, and Baoquan Chen. Towards robust probabilistic modeling on so (3) via rotation laplace distribution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3469–3486,

  69. [69]

    Image based localization in urban environments

    Wei Zhang and Jana Kosecka. Image based localization in urban environments. InThird international symposium on 3D data processing, visualization, and transmission (3DPVT’06), pages 33–40. IEEE, 2006. 2 11

  70. [70]

    Supplementary Tables Table 6

    Supplementary Material: InCaRPose 7.1. Supplementary Tables Table 6. Detailed inference runtime measurements on a single NVIDIA RTX 4090 GPU. We report average per-frame latency (ms), frames per second (FPS), and relative speedup with respect to the FP32 baseline at the corresponding backbone and resolution. Backbone Res. Config Latency (ms) FPS Speedup S...

  71. [71]

    The vector’s direction specifies the rotation axis u, and its magnitude represents the rotation angle θ=∥ω∥ 2 in radians

    Rotation Vector( R3): The rotation is represented by a compact axis-angle vector ω. The vector’s direction specifies the rotation axis u, and its magnitude represents the rotation angle θ=∥ω∥ 2 in radians. The mappingto a rotation matrixRis given by Rodrigues’ [48] formula: R=I+ sinθ θ [ω]× + 1−cosθ θ2 [ω]2 × (3) where [ω]× is the skew-symmetric matrix of...

  72. [72]

    Euler Angles: Intrinsic Rotation( R3): We support intrinsic rotations (moving axes) using the standardZY X convention. Given angles (α, β, γ), the final rotation matrix is computed by successive rotations around the transformed axes: Rint =R z(α)Ry′(β)Rx′′(γ) (4) The final output isy= [α, β, γ, t x, ty, tz]⊤

  73. [73]

    For a sequence (γ, β, α), the resulting ma- trix is: Rext =R z(α)Ry(β)Rx(γ) (5) The final output isy= [γ, β, α, t x, ty, tz]⊤

    Euler Angles: Extrinsic Rotation( R3): Extrinsic ro- tations are performed around the fixed, global axes (X, Y, Z). For a sequence (γ, β, α), the resulting ma- trix is: Rext =R z(α)Ry(β)Rx(γ) (5) The final output isy= [γ, β, α, t x, ty, tz]⊤

  74. [74]

    The mapping toRis defined as: R=   1−2(y 2 +z 2) 2(xy−wz) 2(xz+wy) 2(xy+wz) 1−2(x 2 +z 2) 2(yz−wx) 2(xz−wy) 2(yz+wx) 1−2(x 2 +y 2)   (6) The final output isy= [q ⊤,t ⊤]⊤

    Quaternions( R4): The rotation is represented by a unit quaternion q= [w, x, y, z] ⊤, where ∥q∥2 = 1 . The mapping toRis defined as: R=   1−2(y 2 +z 2) 2(xy−wz) 2(xz+wy) 2(xy+wz) 1−2(x 2 +z 2) 2(yz−wx) 2(xz−wy) 2(yz+wx) 1−2(x 2 +y 2)   (6) The final output isy= [q ⊤,t ⊤]⊤

  75. [75]

    Rotation Matrix( R9): The rotation is represented di- rectly by the flattened elements ofR∈R 3×3. The matrix must satisfy the constraints of the Special Orthogonal group: SO(3) ={R∈R 3×3 :R ⊤R=I,det(R) = +1} (7) The final output is the flattened nine el- ements of R followed by t, resulting in y= [r 11, r12, . . . , r33, tx, ty, tz]⊤. If the rotation is d...