arxiv: 2604.03814 · v1 · submitted 2026-04-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

Felix Stillger , Lukas Hahn , Frederik Hasecke , Tobias Meisen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords relative camera pose estimationin-cabin monitoringfisheye camerastransformer decodersynthetic dataextrinsic calibrationmetric scaleautomotive vision

0 comments

The pith

InCaRPose estimates absolute metric-scale relative poses for in-cabin fisheye cameras in one inference step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InCaRPose, a Transformer-based model that predicts the relative pose between pairs of highly distorted fisheye images taken inside a vehicle cabin. It uses frozen features from a backbone such as DINOv3 processed by a Transformer decoder to recover the geometric relationship between a reference and target view. The central advance is that this produces absolute metric-scale translation values that fall inside the physically plausible adjustment range of typical in-cabin camera mounts, all in a single forward pass. Such metric accuracy matters for safety-critical perception tasks in automotive interior monitoring, where real-world distances directly affect driver or occupant detection. The model is trained exclusively on synthetic data yet generalizes to real cabin scenes without needing identical camera intrinsics, runs in real time even with a small backbone, and performs competitively on the public 7-Scenes benchmark.

Core claim

InCaRPose is a Transformer-based architecture for robust relative pose prediction between image pairs from fisheye cameras in automotive interiors. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, the model captures the geometric relationship between a reference and a target view. It achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step. Trained exclusively on synthetic data, it generalizes to real-world cabin environments without relying on the exact same camera intrinsics, maintains high precision in both rotation and translation even with a ViT-Small backbone, 0

What carries the argument

Transformer-based decoder that processes frozen DINOv3 backbone features to predict the geometric relationship and metric-scale translation between reference and target views.

If this is right

Delivers metric-scale translation in one step, removing the need for multi-view or iterative calibration routines in in-cabin settings.
Generalizes from synthetic training data to real distorted cabin images without requiring matched camera intrinsics.
Achieves competitive accuracy on the public 7-Scenes dataset while using a small backbone suitable for real-time inference.
Enables precise distance measurements required for safety-relevant perception tasks such as driver monitoring.
Provides a new public real-world test dataset of highly distorted vehicle-interior image pairs for further research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-step metric output could simplify factory calibration workflows for vehicle interior cameras by removing the need for specialized rigs.
The same synthetic-to-real transfer pattern might extend to pose estimation in other fixed-mount environments such as aircraft cabins or industrial inspection setups.
If the metric scale remains reliable under small mount shifts, the approach could support periodic online recalibration during vehicle operation without stopping the car.
Success with limited synthetic data suggests that domain-specific geometric tasks in constrained spaces may not always require large volumes of real labeled imagery.

Load-bearing premise

Training exclusively on synthetic data produces a model that generalizes pose estimation to real fisheye images in vehicle cabins even when the camera intrinsics do not match the training setup.

What would settle it

A quantitative test on real in-cabin fisheye images with varied intrinsics and mount positions where the predicted translation error exceeds the typical physical adjustment range of cabin camera mounts on a majority of samples.

Figures

Figures reproduced from arXiv: 2604.03814 by Felix Stillger, Frederik Hasecke, Lukas Hahn, Tobias Meisen.

**Figure 2.** Figure 2: Camera coordinate system of the standard view compared [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: InCaRPose’s architecture overview. Two images are encoded by a frozen ViT backbone and fused by a cross-attention Transformer [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on real-world inference. All translation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: COLMAP fails to estimate translation along the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 6.** Figure 6: Comparison of preprocessing methods. (a) and (b): im [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: Rotation error in degrees versus the number of parameters. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: ArUco (orange) vs. COLMAP (blue) camera trajectories for intervals focused on specific transformations. Each sequence [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Inference on frames with physically occluded ArUco markers. We also changed the reference image, in which a different object [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InCaRPose delivers a targeted Transformer decoder on frozen DINOv3 for metric-scale relative pose in fisheye in-cabin scenes after synthetic-only training, plus a new real test dataset.

read the letter

The main point is a model that takes fisheye image pairs from car interiors and outputs relative pose with absolute metric translation in one forward pass. It freezes a DINOv3 backbone, adds a Transformer decoder, trains only on synthetic data with varied intrinsics, and still claims to work on real cabins. The authors also release code and a new In-Cabin-Pose test set of real distorted interior images, plus they report competitive numbers on 7-Scenes.

Referee Report

2 major / 2 minor

Summary. The paper introduces InCaRPose, a Transformer decoder operating on frozen DINOv3 (or similar) backbone features to predict relative camera pose from fisheye image pairs captured inside vehicle cabins. Trained exclusively on synthetic data with varied intrinsics, the model claims to recover absolute metric-scale translations within the physically plausible range of in-cabin mount adjustments in a single forward pass, to generalize to real-world distorted cabin imagery without matching intrinsics, and to achieve competitive accuracy on the 7-Scenes benchmark while enabling real-time inference with a ViT-Small backbone. The authors also release a new real-world In-Cabin-Pose test dataset and accompanying code.

Significance. If the central claims are substantiated, the work would be a useful contribution to constrained-environment extrinsic calibration for automotive interior monitoring. Strengths include the release of a real test set and code, the use of frozen backbones for efficiency, training with varied synthetic intrinsics to encourage generalization, and the direct production of metric-scale output without post-processing or scale recovery steps. These elements address practical needs in safety-critical perception where accurate real-world distances matter.

major comments (2)

[Abstract and Results] Abstract and Results sections: competitive performance is asserted on both the new in-cabin test set and 7-Scenes, yet no quantitative error bars, standard deviations across runs, or statistical tests are reported; this omission directly weakens confidence in the metric-scale translation claim and the generalization statement.
[Methods and Experiments] Methods and Experiments: the claim that synthetic-only training suffices for real-world metric-scale recovery rests on the model internalizing cabin geometry and distortion statistics, but the manuscript provides insufficient ablation or cross-intrinsic validation (e.g., testing on real images whose intrinsics differ substantially from the synthetic distribution) to confirm that the scale is not an artifact of the training distribution.

minor comments (2)

Figure captions and legends should explicitly state the units and reference frames used for translation error (meters) and rotation error (degrees) to avoid ambiguity when comparing to prior work.
A direct comparison table against classical methods (e.g., essential-matrix decomposition followed by scale recovery) on the released real test set would strengthen the practical advantage claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, agreeing on the need for stronger statistical reporting and additional validation experiments. We commit to incorporating these changes in the revised version.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results sections: competitive performance is asserted on both the new in-cabin test set and 7-Scenes, yet no quantitative error bars, standard deviations across runs, or statistical tests are reported; this omission directly weakens confidence in the metric-scale translation claim and the generalization statement.

Authors: We agree that the lack of error bars and statistical analysis weakens the presentation of our results. In the revised manuscript we will report standard deviations across multiple independent training runs (different random seeds) for all reported metrics on both the In-Cabin-Pose test set and 7-Scenes. We will also add paired statistical tests against the baselines to quantify the significance of the observed improvements in metric-scale translation and generalization. revision: yes
Referee: [Methods and Experiments] Methods and Experiments: the claim that synthetic-only training suffices for real-world metric-scale recovery rests on the model internalizing cabin geometry and distortion statistics, but the manuscript provides insufficient ablation or cross-intrinsic validation (e.g., testing on real images whose intrinsics differ substantially from the synthetic distribution) to confirm that the scale is not an artifact of the training distribution.

Authors: We acknowledge that the current manuscript would benefit from more explicit cross-intrinsic validation. Our training already samples a broad range of synthetic intrinsics, and the released real test set contains images whose intrinsics lie outside that exact distribution. To directly address the concern, the revision will include a dedicated ablation that evaluates the model on real images with substantially different focal lengths and distortion parameters, demonstrating that metric-scale recovery generalizes rather than being an artifact of the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central pipeline trains a Transformer decoder on frozen DINOv3 features using supervised synthetic data where ground-truth relative poses (including metric-scale translations) are known by construction from the rendering process. The absolute metric output is therefore a direct consequence of this external supervision rather than a self-defined or fitted quantity. Generalization claims are evaluated on a separately released real-world test set and the public 7-Scenes benchmark, with no load-bearing self-citations, uniqueness theorems, or ansatz smuggling required to close the derivation. The approach remains self-contained against external benchmarks and does not reduce any prediction to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are identifiable from the abstract; the approach relies on standard Transformer components and a frozen public backbone.

pith-pipeline@v0.9.0 · 5566 in / 1065 out tokens · 44908 ms · 2026-05-13T17:07:28.777278+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Transformer-based architecture ... frozen backbone features such as DINOv3 and a Transformer-based decoder ... predicts the relative camera pose
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves absolute metric-scale translation ... within the physically plausible adjustment range

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 4 internal anchors

[1]

Netvlad: Cnn architecture for weakly supervised place recognition

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016. 2

work page 2016
[2]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Dani- yar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022. 2

work page 2022
[3]

Relocnet: Continuous metric learning relocalisation using neural nets

Vassileios Balntas, Shuda Li, and Victor Prisacariu. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European conference on computer vision (ECCV), pages 751–767, 2018. 2

work page 2018
[4]

Multi-hmr: Multi-person whole-body human mesh re- covery in a single shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br´egier, Philippe Weinzaepfel, Gr´egory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body human mesh re- covery in a single shot. InEuropean Conference on Computer Vision, pages 202–218. Springer, 2024. 3

work page 2024
[5]

Magsac++, a fast, reliable and accurate robust estima- tor

Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. Magsac++, a fast, reliable and accurate robust estima- tor. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1304–1312, 2020. 2

work page 2020
[6]

Performance of event data recorders found in toyota airbag control modules in high severity frontal oblique offset crash tests

William Bortles and Ryan Hostetler. Performance of event data recorders found in toyota airbag control modules in high severity frontal oblique offset crash tests. Technical report, SAE Technical Paper, 2019. 2

work page 2019
[7]

Learning less is more-6d camera localization via 3d surface regression

Eric Brachmann and Carsten Rother. Learning less is more-6d camera localization via 3d surface regression. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4654–4662, 2018. 2

work page 2018
[8]

Visual camera re- localization from rgb and rgb-d images using dsac.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5847–5865, 2021

Eric Brachmann and Carsten Rother. Visual camera re- localization from rgb and rgb-d images using dsac.IEEE transactions on pattern analysis and machine intelligence, 44 (9):5847–5865, 2021

work page 2021
[9]

Dsac-differentiable ransac for camera localization

Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 6684–6692, 2017. 2

work page 2017
[10]

G. Bradski. The OpenCV Library.Dr. Dobb’s Journal of Software Tools, 2000. 4

work page 2000
[11]

Wide- baseline relative camera pose estimation with directional learning

Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3258–3268,

work page
[12]

Dfnet: Enhance absolute pose regression with direct feature matching

Shuai Chen, Xinghui Li, Zirui Wang, and Victor A Prisacariu. Dfnet: Enhance absolute pose regression with direct feature matching. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2022. 2

work page 2022
[13]

Neural refinement for absolute pose regression with feature synthesis

Shuai Chen, Yash Bhalgat, Xinghui Li, Jia-Wang Bian, Ke- jie Li, Zirui Wang, and Victor Adrian Prisacariu. Neural refinement for absolute pose regression with feature synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20987–20996, 2024. 2

work page 2024
[14]

Map-relative pose regression for visual re-localization

Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-relative pose regression for visual re-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20665– 20674, 2024. 2

work page 2024
[15]

Recording automotive crash event data

Augustus Chidester, John Hinch, Thomas C Mercer, and Keith S Schultz. Recording automotive crash event data. In Transportation Recording: 2000 and Beyond. International Symposium on Transportation RecordersNational Transporta- tion Safety BoardInternational Transportation Safety Associ- ation, 1999. 2

work page 2000
[16]

Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 3

work page 2018
[17]

Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16739–16752, 2025. 2, 6, 7, 8, 13

work page 2025
[18]

D2-net: A trainable cnn for joint detection and description of local fea- tures

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local fea- tures. InCVPR 2019-IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2

work page 2019
[19]

Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014. 2

work page 2014
[20]

Rpnet: An end-to-end network for relative camera pose estimation

Sovann En, Alexis Lechervy, and Fr´ed´eric Jurie. Rpnet: An end-to-end network for relative camera pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018. 7, 8

work page 2018
[21]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 5

work page 1981
[22]

Auto- matic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recognition, 47(6):2280– 2292, 2014

Sergio Garrido-Jurado, Rafael Mu˜noz-Salinas, Francisco Jos´e Madrid-Cuevas, and Manuel Jes ´us Mar´ın-Jim´enez. Auto- matic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recognition, 47(6):2280– 2292, 2014. 3

work page 2014
[23]

Generation of fiducial marker dictionaries using mixed integer linear programming.Pattern recognition, 51:481–491, 2016

Sergio Garrido-Jurado, Rafael Munoz-Salinas, Francisco Jos´e Madrid-Cuevas, and Rafael Medina-Carnicer. Generation of fiducial marker dictionaries using mixed integer linear programming.Pattern recognition, 51:481–491, 2016. 4

work page 2016
[24]

JHU press, 2013

Gene H Golub and Charles F Van Loan.Matrix computations. JHU press, 2013. 5

work page 2013
[25]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geom- etry in computer vision. Cambridge university press, 2003. 2

work page 2003
[26]

Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision-language models.arXiv preprint arXiv:2510.23095, 2025. 3 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Robust im- age retrieval-based visual localization using kapture.arXiv preprint arXiv:2007.13867, 2020

Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Vincent Leroy, J´erˆome Revaud, Philippe Rerole, No´e Pion, Cesar De Souza, and Gabriela Csurka. Robust im- age retrieval-based visual localization using kapture.arXiv preprint arXiv:2007.13867, 2020. 2

work page arXiv 2007
[28]

Learning to localize in unseen scenes with relative pose regressors.arXiv preprint arXiv:2303.02717, 2023

Ofer Idan, Yoli Shavit, and Yosi Keller. Learning to localize in unseen scenes with relative pose regressors.arXiv preprint arXiv:2303.02717, 2023. 2, 7

work page arXiv 2023
[29]

Beyond familiar land- scapes: Exploring the limits of relative pose regressors in new environments.Computer Vision and Image Understanding, page 104629, 2026

Ofer Idan, Yoli Shavit, and Yosi Keller. Beyond familiar land- scapes: Exploring the limits of relative pose regressors in new environments.Computer Vision and Image Understanding, page 104629, 2026. 2

work page 2026
[30]

In-cabin sensing 2024–2034: Technologies, op- portunities and markets

IDTechEx. In-cabin sensing 2024–2034: Technologies, op- portunities and markets. Technical report, IDTechEx Re- search, 2024. Market analysis report projecting rapid growth of in-cabin sensing driven by regulation and OEM adoption. 1

work page 2024
[31]

Road vehicles — vehicle dynamics and road-holding ability — vocabulary,

International Organization for Standardization. Road vehicles — vehicle dynamics and road-holding ability — vocabulary,

work page
[32]

Geometric loss functions for camera pose regression with deep learning

Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5974–5983, 2017. 2

work page 2017
[33]

Posenet: A convolutional network for real-time 6-dof camera relocal- ization

Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocal- ization. InProceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015. 2, 5, 7

work page 2015
[34]

Princeton university press, 1999

Jack B Kuipers.Quaternions and rotation sequences: a primer with applications to orbits, aerospace, and virtual reality. Princeton university press, 1999. 5

work page 1999
[35]

Camera relocalization by computing pairwise rel- ative poses using convolutional neural network

Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, and Juho Kannala. Camera relocalization by computing pairwise rel- ative poses using convolutional neural network. InProceed- ings of the IEEE international conference on computer vision workshops, pages 929–938, 2017. 2, 7

work page 2017
[36]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean confer- ence on computer vision, pages 71–91. Springer, 2024. 2, 3, 6

work page 2024
[37]

An analysis of svd for deep rotation estimation.Advances in Neural Information Processing Systems, 33:22554–22565,

Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, and Ameesh Maka- dia. An analysis of svd for deep rotation estimation.Advances in Neural Information Processing Systems, 33:22554–22565,

work page
[38]

Learning neural volumetric pose features for camera localization

Jingyu Lin, Jiaqi Gu, Bojian Wu, Lubin Fan, Renjie Chen, Ligang Liu, and Jieping Ye. Learning neural volumetric pose features for camera localization. InEuropean Conference on Computer Vision, pages 198–214. Springer, 2024. 2

work page 2024
[39]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. 2, 5, 6

work page 2004
[41]

Relative camera pose estimation using convolu- tional neural networks

Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Relative camera pose estimation using convolu- tional neural networks. InInternational Conference on Ad- vanced Concepts for Intelligent Vision Systems, pages 675–

work page
[42]

Lens: Localization enhanced by nerf synthesis

Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on robot learning, pages 1347–1356. PMLR, 2022. 2

work page 2022
[43]

Fast approximate nearest neighbors with automatic algorithm configuration

Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. InInterna- tional conference on computer vision theory and applications, pages 331–340. Scitepress, 2009. 5

work page 2009
[44]

Deep learning- based gaze detection system for automobile drivers using a nir camera sensor.Sensors, 18(2):456, 2018

Rizwan Ali Naqvi, Muhammad Arsalan, Ganbayar Batchu- luun, Hyo Sik Yoon, and Kang Ryoung Park. Deep learning- based gaze detection system for automobile drivers using a nir camera sensor.Sensors, 18(2):456, 2018. 1

work page 2018
[45]

An efficient solution to the five-point relative pose problem.IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004

David Nist´er. An efficient solution to the five-point relative pose problem.IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004. 5

work page 2004
[46]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Marcin Piotrowski, Lukasz Dziuda, and Paulina Baran. Au- tomotive interior monitoring systems: A review of selected technical solutions for the recognition of fatigue symptoms in motor vehicle drivers.The Polish Journal of Aviation Medicine, Bioengineering and Psychology, 28:31–41, 2025. 1

work page 2025
[48]

Olinde Rodrigues. Des lois g ´eom´etriques qui r ´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la vari- ation des coordonn ´ees provenant de ces d ´eplacements con- sid´er´es ind´ependamment des causes qui peuvent les produire. Journal de math´ematiques pures et appliqu´ees, 5:380–440,

work page
[49]

In-cabin sensing has automakers look- ing inward.RTInsights, 2025

Salvatore Salamone. In-cabin sensing has automakers look- ing inward.RTInsights, 2025. Industry analysis of OEM adoption of in-cabin sensing driven by safety ratings and semi-autonomous driving. 1

work page 2025
[50]

Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers

Mert B¨ulent Sarıyıldız, Philippe Weinzaepfel, Thomas Lucas, Pau De Jorge, Diane Larlus, and Yannis Kalantidis. Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30084–30094, 2025. 3, 5, 6

work page 2025
[51]

Superglue: Learning feature match- ing with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature match- ing with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 4938–4947, 2020. 2

work page 2020
[52]

Understanding the limitations of cnn-based absolute camera pose regression

Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal- Taixe. Understanding the limitations of cnn-based absolute camera pose regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3302–3312, 2019. 2

work page 2019
[53]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 3 10

work page 2016
[54]

Scene coordinate regression forests for camera relocalization in rgb- d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 5, 6

work page 2013
[55]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 3, 4, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[57]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931,

work page
[58]

Active nir illu- mination for improved camera view in automated driving application

Max C Sundermeier, Hauke Dierend, Peer-Phillip Ley, Alexander Wolf, and Roland Lachmayer. Active nir illu- mination for improved camera view in automated driving application. InLight-Emitting Devices, Materials, and Appli- cations XXVI, pages 54–62. SPIE, 2022. 1

work page 2022
[59]

Validation of event data recorders in high severity full-frontal crash tests.SAE International Journal of Transportation Safety, 1(2013-01-1265):76–99, 2013

Ada Tsoi, John Hinch, Richard Ruth, and Hampton Gabler. Validation of event data recorders in high severity full-frontal crash tests.SAE International Journal of Transportation Safety, 1(2013-01-1265):76–99, 2013. 2, 6

work page 2013
[60]

Visual camera re- localization using graph neural networks and relative pose supervision

Mehmet Ozgur Turkoglu, Eric Brachmann, Konrad Schindler, Gabriel J Brostow, and Aron Monszpart. Visual camera re- localization using graph neural networks and relative pose supervision. In2021 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2021. 2, 7

work page 2021
[61]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017
[62]

Absolute pose from one or two scaled and oriented features

Jonathan Ventura, Zuzana Kukelova, Torsten Sattler, and D´aniel Bar´ath. Absolute pose from one or two scaled and oriented features. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 20870–20880, 2024. 2

work page 2024
[63]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 20697–20709, 2024. 2, 6

work page 2024
[64]

Summarizing regional regulations for mandating driver monitoring systems.Automation.com, 2023

Yulin Wang. Summarizing regional regulations for mandating driver monitoring systems.Automation.com, 2023. Overview of global driver monitoring regulations including EU GSR and ADDW requirements. 1

work page 2023
[65]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J´erˆome Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Advances in Neural Information Pro- cessing Systems, 35:3502–3516, 2022. 2

work page 2022
[66]

Learning to localize in new environments from syn- thetic training data

Dominik Winkelbauer, Maximilian Denninger, and Rudolph Triebel. Learning to localize in new environments from syn- thetic training data. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5840–5846. IEEE,

work page
[67]

Spatialformer: Towards generalizable vision transformers with explicit spatial understanding

Han Xiao, Wenzhao Zheng, Sicheng Zuo, Peng Gao, Jie Zhou, and Jiwen Lu. Spatialformer: Towards generalizable vision transformers with explicit spatial understanding. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024. 3

work page 2024
[68]

Towards robust probabilistic modeling on so (3) via rotation laplace distribution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3469–3486,

Yingda Yin, Jiangran Lyu, Yang Wang, Haoran Liu, He Wang, and Baoquan Chen. Towards robust probabilistic modeling on so (3) via rotation laplace distribution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3469–3486,

work page
[69]

Image based localization in urban environments

Wei Zhang and Jana Kosecka. Image based localization in urban environments. InThird international symposium on 3D data processing, visualization, and transmission (3DPVT’06), pages 33–40. IEEE, 2006. 2 11

work page 2006
[70]

Supplementary Tables Table 6

Supplementary Material: InCaRPose 7.1. Supplementary Tables Table 6. Detailed inference runtime measurements on a single NVIDIA RTX 4090 GPU. We report average per-frame latency (ms), frames per second (FPS), and relative speedup with respect to the FP32 baseline at the corresponding backbone and resolution. Backbone Res. Config Latency (ms) FPS Speedup S...

work page
[71]

The vector’s direction specifies the rotation axis u, and its magnitude represents the rotation angle θ=∥ω∥ 2 in radians

Rotation Vector( R3): The rotation is represented by a compact axis-angle vector ω. The vector’s direction specifies the rotation axis u, and its magnitude represents the rotation angle θ=∥ω∥ 2 in radians. The mappingto a rotation matrixRis given by Rodrigues’ [48] formula: R=I+ sinθ θ [ω]× + 1−cosθ θ2 [ω]2 × (3) where [ω]× is the skew-symmetric matrix of...

work page
[72]

Euler Angles: Intrinsic Rotation( R3): We support intrinsic rotations (moving axes) using the standardZY X convention. Given angles (α, β, γ), the final rotation matrix is computed by successive rotations around the transformed axes: Rint =R z(α)Ry′(β)Rx′′(γ) (4) The final output isy= [α, β, γ, t x, ty, tz]⊤

work page
[73]

For a sequence (γ, β, α), the resulting ma- trix is: Rext =R z(α)Ry(β)Rx(γ) (5) The final output isy= [γ, β, α, t x, ty, tz]⊤

Euler Angles: Extrinsic Rotation( R3): Extrinsic ro- tations are performed around the fixed, global axes (X, Y, Z). For a sequence (γ, β, α), the resulting ma- trix is: Rext =R z(α)Ry(β)Rx(γ) (5) The final output isy= [γ, β, α, t x, ty, tz]⊤

work page
[74]

The mapping toRis defined as: R=   1−2(y 2 +z 2) 2(xy−wz) 2(xz+wy) 2(xy+wz) 1−2(x 2 +z 2) 2(yz−wx) 2(xz−wy) 2(yz+wx) 1−2(x 2 +y 2)   (6) The final output isy= [q ⊤,t ⊤]⊤

Quaternions( R4): The rotation is represented by a unit quaternion q= [w, x, y, z] ⊤, where ∥q∥2 = 1 . The mapping toRis defined as: R=   1−2(y 2 +z 2) 2(xy−wz) 2(xz+wy) 2(xy+wz) 1−2(x 2 +z 2) 2(yz−wx) 2(xz−wy) 2(yz+wx) 1−2(x 2 +y 2)   (6) The final output isy= [q ⊤,t ⊤]⊤

work page
[75]

Rotation Matrix( R9): The rotation is represented di- rectly by the flattened elements ofR∈R 3×3. The matrix must satisfy the constraints of the Special Orthogonal group: SO(3) ={R∈R 3×3 :R ⊤R=I,det(R) = +1} (7) The final output is the flattened nine el- ements of R followed by t, resulting in y= [r 11, r12, . . . , r33, tx, ty, tz]⊤. If the rotation is d...

work page