VOCA: Visual Odometry with Codec Awareness

Christoph Otten genannt Hermes; Daniel Cremers; Dominik Muhle; Mateo de Mayo; Nouri Alexander Hilscher

arxiv: 2607.00189 · v1 · pith:7C3RSH6Ynew · submitted 2026-06-30 · 💻 cs.CV

VOCA: Visual Odometry with Codec Awareness

Nouri Alexander Hilscher , Mateo de Mayo , Dominik Muhle , Christoph Otten genannt Hermes , Daniel Cremers This is my paper

Pith reviewed 2026-07-02 19:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual odometrycodec awarenesscompressed videostereo trackingcamera pose estimationcausal VOvideo compression artifacts

0 comments

The pith

VOCA improves causal stereo visual odometry on compressed video by incorporating codec information to reduce artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VOCA as a method for camera pose estimation from compressed image streams rather than the raw video used in most prior visual odometry work. Lossy compression creates visual artifacts that degrade standard tracking performance, and the approach shows that codec data can be extracted and used directly inside a causal stereo pipeline to counteract those effects. On standard benchmarks this yields better relative trajectory error, absolute trajectory error, and efficiency than existing causal methods when streams are compressed.

Core claim

VOCA is a causal stereo visual-odometry method that exploits codec information to improve tracking performance. It achieves state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams.

What carries the argument

The VOCA pipeline, which feeds codec-derived information into a causal stereo visual-odometry estimator to mitigate compression artifacts during pose tracking.

If this is right

Real-world spatial world models that receive compressed video can maintain higher tracking accuracy without increasing bandwidth or storage.
Perception modules inside planning systems can operate directly on hardware-decoded streams rather than requiring raw frames.
Efficiency gains allow longer operation on resource-constrained platforms that already use video codecs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same codec signals could be tested on other perception tasks such as depth estimation or object tracking that currently assume uncompressed input.
Extending the approach beyond stereo to monocular or multi-camera setups would test whether codec awareness generalizes when baseline geometry changes.
Integration with learned feature detectors might further reduce reliance on hand-crafted handling of compression noise.

Load-bearing premise

That codec information can be extracted and leveraged in a causal stereo setup to measurably mitigate compression artifacts and produce superior trajectory estimates compared to existing methods on standard benchmarks.

What would settle it

Running VOCA and prior causal stereo methods on the same compressed benchmark streams while withholding all codec metadata and measuring whether the performance gap disappears.

Figures

Figures reproduced from arXiv: 2607.00189 by Christoph Otten genannt Hermes, Daniel Cremers, Dominik Muhle, Mateo de Mayo, Nouri Alexander Hilscher.

**Figure 1.** Figure 1: Visual Odometry on Compressed Videos. We present VOCA, a novel Visual Odometry system that produces smoother, more stable trajectories than descriptor-based systems such as ORB-SLAM3 and OKVIS2, thanks to its codec-aware sparse optical-flow frontend. Our system enables Visual Odometry on data compressed by up to 100 ×. We visualize challenging segments with dashed-red markers, sampled from three different … view at source ↗

**Figure 2.** Figure 2: Video encoding can introduce artifacts that violate the photometric-constancy assumption used by most tracking algorithms. Examples from datasets used in this work: EV203 (from EuRoC [6]) shows blurred details, TR2 (from TUM-VI room [54]) exhibits reduced contrast in the thin net, and MOO02 (from MSD [39]) contains geometrically jagged edges/textures. See [35] for additional artifacts. causal estimates, m… view at source ↗

**Figure 3.** Figure 3: Encoded information. Video codecs assign motion vectors to macroblock partitions and sub-partitions. They encode local block motion as a displacement to an intensity-matched block in a reference frame. The figure shows a frame with macroblock partitions (left), its motion vectors (center), and the normalized discrete vector field they induce (right). While this field is correlated with optical flow, it is … view at source ↗

**Figure 4.** Figure 4: Proximity to the minimum. VOCA uses motion vectors as priors for optical flow, which significantly reduces the distance to the ground-truth solution, possibly improving convergence times and the likelihood that the initial state lies in the convergence basin of the non-linear problem. This figure shows the error distributions of pixel distances between the prior guesses (initialization) and the final tra… view at source ↗

**Figure 5.** Figure 5: Qualitative trajectory examples. To showcase the quality of VOCA trajectories, in addition to [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: MSD overview. We indicate divergences and resets with gray ∞ and ⟳, respectively. First-place ranks for each sequence are shown in green, and non-first ranks are shown in blue. See the supplementary for the detailed numbers of each run. 4.5 Ablation Study We consider three strategies for integrating motion vectors (MV) into tracking, balancing the strength of the MV prior against the risk of propagating ou… view at source ↗

**Figure 7.** Figure 7: Median ATE/RTE vs. Bitrate. Lower bitrates lead to smaller file sizes. Numbers indicate a success rate of less than 100% for ATE. VOCA shows stable performance even at low bitrates. As a tracking-based system, Basalt [64] suffers from compression artifacts, especially on the TUM-VI dataset. ORB-SLAM3 [7] shows steady performance in ATE on TUM-VI but delivers poor performance in RTE. This indicates good g… view at source ↗

**Figure 5.** Figure 5: Symbols and shading follow the conventions of Tab. 5. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 8.** Figure 8: H.264 and AV1 motion-vector priors on TUM-VI Room. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

read the original abstract

Camera pose estimation from image streams is a critical component of spatial world models that integrate perception into planning and decision-making. Nearly all Visual Odometry (VO) and Simultaneous Localization and Mapping (V-SLAM) systems have focused on datasets containing raw, uncompressed videos. Many working systems instead use ubiquitous hardware units to efficiently compress and decode video streams, saving orders of magnitude in storage and bandwidth. However, this lossy compression introduces visual artifacts that hinder the performance of traditional tracking systems. We present VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. We achieve state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams. This work highlights the potential of leveraging widely available video codec information for vision tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VOCA claims codec side information improves causal stereo VO on compressed streams, but the abstract alone gives no way to check if the SOTA results hold.

read the letter

The main takeaway is that VOCA tries to make visual odometry work on the compressed video that real devices produce by feeding codec information into the tracker. The abstract positions this as a fix for artifacts that hurt standard methods, with claims of better relative trajectory error, absolute trajectory error, and efficiency under causal stereo constraints.

What stands out as new is the direct use of codec data rather than just decoding to raw frames first. The paper notes that almost all prior VO work assumes uncompressed input, which is a real mismatch with hardware pipelines that compress to save bandwidth.

It does a reasonable job framing the practical gap. Compression is ubiquitous, and artifacts are known to degrade feature tracking, so treating codec outputs as usable side information is a logical step.

The soft spots are the complete absence of supporting material. No description of how codec elements like motion vectors are extracted or fused, no equations, no dataset details, no baselines, and no numbers. The SOTA assertions on compressed streams cannot be evaluated from what is here. The central assumption—that codec info can be leveraged causally to produce measurably better trajectories—remains untested in the provided text.

This would interest people building VO for embedded robotics or AR where video is already compressed. A reader focused on deployment realism might find the direction useful if the full paper shows clean experiments.

I would send it to peer review to let referees check whether the method and results actually support the claims, but the current version is too thin to judge on its own.

Referee Report

1 major / 0 minor

Summary. The manuscript presents VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. It claims state-of-the-art results for relative trajectory error, efficiency, and absolute trajectory error on compressed streams, highlighting the use of widely available video codec side information for vision tasks.

Significance. If the results hold, the work addresses a practical gap by showing how codec side information can mitigate compression artifacts in causal stereo VO, which could improve robustness and efficiency in real-world systems that rely on compressed video rather than raw streams.

major comments (1)

[Abstract] Abstract: the central claim of achieving SOTA performance on RTE, efficiency, and ATE for causal VO on compressed streams is unsupported, as the text provides no experimental setup, datasets, baselines, quantitative results, or validation details.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the practical relevance of exploiting codec side information in causal stereo visual odometry. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of achieving SOTA performance on RTE, efficiency, and ATE for causal VO on compressed streams is unsupported, as the text provides no experimental setup, datasets, baselines, quantitative results, or validation details.

Authors: Abstracts are concise summaries and do not contain experimental details by design. The full manuscript provides the requested information: Section 4 describes the experimental setup and datasets (including compressed streams from standard benchmarks), Section 5 details the baselines and quantitative comparisons, and Tables 1-3 report the RTE, efficiency, and ATE results demonstrating state-of-the-art performance for causal VO. These sections validate the abstract claims with specific metrics and ablation studies. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical method for causal stereo visual odometry that incorporates codec side information to mitigate compression artifacts. No derivation chain, equations, fitted parameters, uniqueness theorems, or ansatzes are presented in the provided text. Claims reduce to experimental results on standard benchmarks (RTE, ATE, efficiency), which are externally falsifiable and not equivalent to inputs by construction. No self-citation load-bearing steps or renamings of known results appear. This is the expected outcome for a systems/implementation paper without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5672 in / 991 out tokens · 27739 ms · 2026-07-02T19:21:02.402053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 27 canonical work pages · 3 internal anchors

[1]

In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN)

Arunruangsirilert, K., Katto, J.: Evaluation of hardware-based video encoders on modern gpus for uhd live-streaming. In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN). pp. 1–9 (2024).https:// doi.org/10.1109/ICCCN61486.2024.10637525

work page doi:10.1109/icccn61486.2024.10637525 2024
[2]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from Human Videos as a Versatile Representation for Robotics. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 01–13. IEEE, Vancou- ver, BC, Canada (Jun 2023).https://doi.org/10.1109/CVPR52729.2023.01324

work page doi:10.1109/cvpr52729.2023.01324 2023
[3]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., Zhang, L., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., Engel, J.J., Hodan, T.: HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7061–7071 (Jun 2025).https://d...

work page doi:10.1109/cvpr52734.2025 2025
[4]

In: Euro- pean conference on computer vision

Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Euro- pean conference on computer vision. pp. 404–417. Springer (2006)

2006
[5]

IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

Bross, B., Wang, Y.K., Ye, Y., Liu, S., Chen, J., Sullivan, G.J., Ohm, J.R.: Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

work page doi:10.1109/tcsvt.2021.3101953 2021
[6]

The International Journal of Robotics Research35(10), 1157–1163 (2016)

Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari, S., Achtelik, M.W., Siegwart, R.: The euroc micro aerial vehicle datasets. The International Journal of Robotics Research35(10), 1157–1163 (2016)

2016
[7]

IEEE transactions on robotics37(6), 1874–1890 (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)

2021
[8]

Carlone, L., Kim, A., Barfoot, T., Cremers, D., Dellaert, F.: Slam handbook: From localization and mapping to spatial intelligence (2025)

2025
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

Chen, H., Sun, B., Zhang, A., Pollefeys, M., Leutenegger, S.: VidBot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

2025
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19844–19853 (2024)

2024
[11]

In: 2026 International Conference on 3D Vision (3DV)

Chi, Y., Sommer, L., Dünkel, O., Muhle, D., Cremers, D., Theobalt, C., Ko- rtylewski, A.: C3po: Canonicalization of 3d pose from partial views with gener- alizable correspondence features. In: 2026 International Conference on 3D Vision (3DV). pp. 587–597. IEEE (2026)

2026
[12]

Chng, C.K., Parra, A., Chin, T.J., Latif, Y.: Monocular rotational odometry with incrementalrotationaveragingandloopclosure.In:2020DigitalImageComputing: Techniques and Applications (DICTA). pp. 1–8. IEEE (2020)

2020
[13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cin, A.P.D., Dikov, G., Ju, J., Ghafoorian, M.: Anymap: Learning a general cam- era model for structure-from-motion with unknown distortion in dynamic scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16674–16684 (2025)

2025
[14]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Deng, K., Ti, Z., Xu, J., Yang, J., Xie, J.: Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences. arXiv preprint arXiv:2507.16443 (2025) VOCA: Visual Odometry with Codec Awareness 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops. pp. 224–236 (2018)

2018
[16]

IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

2017
[17]

In: European conference on computer vision

Engel, J., Schöps, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: European conference on computer vision. pp. 834–849. Springer (2014)

2014
[18]

In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., Huang, G.: Openvins: A research platform for visual-inertial estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 4666–4672. IEEE (2020)

2020
[19]

Advances in Neural Information Processing Systems38, 4989–5014 (2026)

Gross, M., Fahmy, A., Niwattananan, D., Muhle, D., Song, R., Cremers, D., Meeß, H.: Ipformer: Visual 3d panoptic scene completion with context-adaptive instance proposals. Advances in Neural Information Processing Systems38, 4989–5014 (2026)

2026
[20]

Proceedings of the IEEE109(9), 1435–1462 (2021)

Han, J., Li, B., Mukherjee, D., Chiang, C.H., Grange, A., Chen, C., Su, H., Parker, S., Deng, S., Joshi, U., Chen, Y., Wang, Y., Wilkins, P., Xu, Y., Bankoski, J.: A technical overview of av1. Proceedings of the IEEE109(9), 1435–1462 (2021). https://doi.org/10.1109/JPROC.2021.3058584

work page doi:10.1109/jproc.2021.3058584 2021
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, K., Muhle, D., Wimbauer, F., Cremers, D.: Boosting self-supervision for single-view scene completion via knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9837– 9847 (2024)

2024
[22]

Cam- bridge University Press, Cambridge, 2 edn

Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, Cambridge, 2 edn. (2004).https://doi.org/10.1017/ CBO9780511811685

2004
[23]

In: 2024 International Conference on 3D Vision (3DV)

Hayler, A., Wimbauer, F., Muhle, D., Rupprecht, C., Cremers, D.: S4c: Self- supervised semantic scene completion with neural fields. In: 2024 International Conference on 3D Vision (3DV). pp. 409–420. IEEE (2024)

2024
[24]

In: European Wireless 2023; 28th European Wireless Conference

Hofer, J., et al.: H.264 Compress-Then-Analyze Transmission in Edge-Assisted Visual SLAM. In: European Wireless 2023; 28th European Wireless Conference. pp. 130–135 (2023)

2023
[25]

Hsiao, Y.M., Lee, J.F., Chen, J.S., Chu, Y.S.: Review: H.264 video transmis- sions over wireless networks: Challenges and solutions. Comput. Commun.34(14), 1661–1672 (Sep 2011).https://doi.org/10.1016/j.comcom.2011.03.016

work page doi:10.1016/j.comcom.2011.03.016 2011
[26]

International Telecommunication Union: ITU-T Recommendation H.262: Informa- tion technology – Generic coding of moving pictures and associated audio infor- mation: Video.https://www.itu.int/rec/T- REC- H.262(Jan 2021), accessed: 2026-02-08

2021
[27]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An,P.,Xue,X.,Su,Q.,Lyu,H.,Zheng,X.,Liu,J.,Wang,Z.,Zhang,S.:Robobrain: A unified brain model for robotic manipulation from abstract to concrete. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1724–1734 (2025)

2025
[28]

In: 2014 IEEE Conference on Computer Vision and Pattern Recognition

Kantorov, V., Laptev, I.: Efficient feature extraction, encoding, and classification for action recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2593–2600 (2014).https://doi.org/10.1109/CVPR.2014.332

work page doi:10.1109/cvpr.2014.332 2014
[29]

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Universal feed-forward metric 3d reconstruction; map-anything. github. io. In: 2026 Interna- tional Conference on 3D Vision (3DV). pp. 499–509. IEEE (2026) 18 N. Hilscher et al

2026
[30]

In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality

Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality. pp. 225–234. IEEE (2007)

2007
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, S.H., Civera, J.: Rotation-only bundle adjustment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 424– 433 (2021)

2021
[32]

arXiv preprint arXiv:2202.09199 (2022)

Leutenegger, S.: Okvis2: Realtime scalable visual-inertial slam with loop closure. arXiv preprint arXiv:2202.09199 (2022)

work page arXiv 2022
[33]

In: 2011 International conference on computer vision

Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. Ieee (2011)

2011
[34]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

Lin, L., Yu, S., Zhou, L., Chen, W., Zhao, T., Wang, Z.: Pea265: Perceptual assess- ment of video compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

2020
[36]

Liou, M.: Overview of the p×64 kbit/s video coding standard. Commun. ACM 34(4), 59–63 (Apr 1991).https://doi.org/10.1145/103085.103091

work page doi:10.1145/103085.103091 1991
[37]

Interna- tional journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

2004
[38]

In: Hayes, P.J

Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: Hayes, P.J. (ed.) Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada, August 24-28, 1981. pp. 674–679. William Kaufmann (1981)

1981
[39]

In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

de Mayo, M., Cremers, D., Pire, T.: The monado slam dataset for egocentric visual-inertial tracking. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 13111–13118. IEEE (2025)

2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Muhle, D., Koestler, L., Demmel, N., Bernard, F., Cremers, D.: The probabilistic normal epipolar constraint for frame-to-frame rotation optimization under uncer- tain feature positions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1819–1828 (2022)

2022
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Muhle, D., Koestler, L., Jatavallabhula, K.M., Cremers, D.: Learning correspon- dence uncertainty via differentiable nonlinear least squares. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13102– 13112 (2023)

2023
[42]

In: 2013 Picture Coding Symposium (PCS)

Mukherjee, D., Bankoski, J., Grange, A., Han, J., Koleszar, J., Wilkins, P., Xu, Y., Bultje, R.: The latest open-source video codec vp9 - an overview and preliminary results. In: 2013 Picture Coding Symposium (PCS). pp. 390–393 (2013).https: //doi.org/10.1109/PCS.2013.6737765

work page doi:10.1109/pcs.2013.6737765 2013
[43]

Transactions on Robotics (T-RO)31(5), 1147–1163 (2015).https://doi.org/10.1109/TRO.2015.2463671

Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics31(5), 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671

work page doi:10.1109/tro.2015.2463671 2015
[44]

IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monoc- ular, stereo, and rgb-d cameras. IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

work page doi:10.1109/tro.2017.2705103 2017
[45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Murai, R., Dexheimer, E., Davison, A.J.: Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16695–16705 (2025) VOCA: Visual Odometry with Codec Awareness 19

2025
[46]

ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10

Peroni, L., Gorinsky, S.: An end-to-end pipeline perspective on video streaming in best-effort networks: A survey and tutorial. ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10. 1145/3742472

work page doi:10.1145/3742472 2025
[47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3D-MVP: 3D Mul- tiview Pretraining for Manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

2025
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Reich, C., Hahn, O., Cremers, D., Roth, S., Debnath, B.: A perspective on deep vision performance with standard image and video codecs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5712– 5721 (2024)

2024
[49]

In: 2011 International Conference on Computer Vision

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision. pp. 2564–2571 (2011).https://doi.org/10.1109/ICCV.2011.6126544

work page doi:10.1109/iccv.2011.6126544 2011
[50]

In: 2021 International conference on unmanned aircraft systems (ICUAS)

Rückert,D.,Stamminger,M.:Snake-slam:Efficientglobalvisualinertialslamusing decoupled nonlinear optimization. In: 2021 International conference on unmanned aircraft systems (ICUAS). pp. 219–228. IEEE (2021)

2021
[51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Sandström,E.,Zhang,G.,Tateno,K.,Oechsle,M.,Niemeyer,M.,Zhang,Y.,Patel, M., Van Gool, L., Oswald, M., Tombari, F.: Splat-slam: Globally optimized rgb- only slam with 3d gaussians. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1680–1691 (2025)

2025
[52]

Proceedings of the IEEE83(6), 907–924 (1995).https://doi

Schafer, R., Sikora, T.: Digital video coding standards and their role in video communications. Proceedings of the IEEE83(6), 907–924 (1995).https://doi. org/10.1109/5.387092

work page doi:10.1109/5.387092 1995
[53]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

2016
[54]

In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Schubert, D., Goll, T., Demmel, N., Usenko, V., Stückler, J., Cremers, D.: The tum vi benchmark for evaluating visual-inertial odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1680–

2018
[55]

IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014

Seufert, M., Egger, S., Slanina, M., Zinner, T., Hoßfeld, T., Tran-Gia, P.: A survey on quality of experience of http adaptive streaming. IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014. 2360940

work page doi:10.1109/comst.2014 2015
[56]

In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA

Shi, J., Tomasi, C.: Good features to track. In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA. pp. 593–600. IEEE (1994).https://doi.org/10.1109/CVPR.1994.323794,https: //doi.org/10.1109/CVPR.1994.323794

work page doi:10.1109/cvpr.1994.323794 1994
[57]

In: 2025 International Conference on 3D Vision (3DV)

Smith, C., Charatan, D., Tewari, A., Sitzmann, V.: Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. In: 2025 International Conference on 3D Vision (3DV). pp. 389–400. IEEE (2025)

2025
[58]

In: Springer handbook of robotics, pp

Stachniss, C., Leonard, J.J., Thrun, S.: Simultaneous localization and mapping. In: Springer handbook of robotics, pp. 1153–1176. Springer (2016)

2016
[59]

IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT

Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT. 2012.2221191

work page doi:10.1109/tcsvt 2012
[60]

Advances in neural information processing systems34, 16558–16569 (2021) 20 N

Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021) 20 N. Hilscher et al

2021
[61]

Tomasi, C., Kanade, T.: Detection and tracking of point features. Tech. rep., In- ternational Journal of Computer Vision (1991)

1991
[62]

In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC)

Turner, R.N., Banerjee, N.K., Banerjee, S.: Mov-slam: Using motion vectors for real-time single-cpu visual slam. In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC). pp. 51–58. IEEE (2023)

2023
[63]

Ungureanu, D., Bogo, F., Galliani, S., Sama, P., Duan, X., Meekhof, C., Stühmer, J., Cashman, T.J., Tekin, B., Schönberger, J.L., Olszta, P., Pollefeys, M.: HoloLens 2 Research Mode as a Tool for Computer Vision Research (Aug 2020).https: //doi.org/10.48550/arXiv.2008.11239,http://arxiv.org/abs/2008.11239

work page doi:10.48550/arxiv.2008.11239 2020
[64]

IEEE Robotics Autom

Usenko, V., Demmel, N., Schubert, D., Stückler, J., Cremers, D.: Visual-inertial mapping with non-linear factor recovery. IEEE Robotics Autom. Lett.5(2), 422– 429 (2020).https://doi.org/10.1109/LRA.2019.2961227,https://doi.org/ 10.1109/LRA.2019.2961227

work page doi:10.1109/lra.2019.2961227 2020
[65]

IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

Von Stumberg, L., Cremers, D.: Dm-vio: Delayed marginalization visual-inertial odometry. IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

2022
[66]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[67]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the h.264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

work page doi:10.1109/tcsvt.2003.815165 2003
[69]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025)

2025
[70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wimbauer, F., Yang, N., Rupprecht, C., Cremers, D.: Behind the scenes: Density fields for single view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9076–9086 (2023)

2023
[71]

In: Proceedings of the 31st ACM International Conference on Multimedia

Zhou, S., Jiang, X., Tan, W., He, R., Yan, B.: Mvflow: Deep optical flow estimation of compressed videos with motion vector prior. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 1964–1974 (2023)

1964
[72]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu, Z., Akkaya, I.B., Waeijen, L., Bondarev, E., Pourtaherian, A., Moreira, O.: MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29309–29320 (Jun 2025).https://doi.org/10.1109/CVPR52734. 2025.02729,https://ieeexplore.ieee.org/document/11092745

work page doi:10.1109/cvpr52734 2025
[73]

In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR)

Zouein, J., Javidnia, H., Pitié, F., Kokaram, A.: Leveraging AV1 Motion Vectors for Fast and Dense Feature Matching. In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR). pp. 1–4.https://doi.org/10.1109/ICIR68135. 2025.11361611

work page doi:10.1109/icir68135 2025
[74]

fast" -tune

Zouein, J., Vibhoothi, V., Kokaram, A.: AV1 Motion Vector Fidelity and Applica- tionforEfficientOpticalFlow.In:2025PictureCodingSymposium(PCS).pp.1–5. https://doi.org/10.1109/PCS65673.2025.11417638 VOCA: Visual Odometry with Codec Awareness 1 A Metrics In this paper, we utilize two commonly used metrics to evaluate the performance of Visual Odometry algor...

work page doi:10.1109/pcs65673.2025.11417638 2025

[1] [1]

In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN)

Arunruangsirilert, K., Katto, J.: Evaluation of hardware-based video encoders on modern gpus for uhd live-streaming. In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN). pp. 1–9 (2024).https:// doi.org/10.1109/ICCCN61486.2024.10637525

work page doi:10.1109/icccn61486.2024.10637525 2024

[2] [2]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from Human Videos as a Versatile Representation for Robotics. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 01–13. IEEE, Vancou- ver, BC, Canada (Jun 2023).https://doi.org/10.1109/CVPR52729.2023.01324

work page doi:10.1109/cvpr52729.2023.01324 2023

[3] [3]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., Zhang, L., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., Engel, J.J., Hodan, T.: HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7061–7071 (Jun 2025).https://d...

work page doi:10.1109/cvpr52734.2025 2025

[4] [4]

In: Euro- pean conference on computer vision

Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Euro- pean conference on computer vision. pp. 404–417. Springer (2006)

2006

[5] [5]

IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

Bross, B., Wang, Y.K., Ye, Y., Liu, S., Chen, J., Sullivan, G.J., Ohm, J.R.: Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

work page doi:10.1109/tcsvt.2021.3101953 2021

[6] [6]

The International Journal of Robotics Research35(10), 1157–1163 (2016)

Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari, S., Achtelik, M.W., Siegwart, R.: The euroc micro aerial vehicle datasets. The International Journal of Robotics Research35(10), 1157–1163 (2016)

2016

[7] [7]

IEEE transactions on robotics37(6), 1874–1890 (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)

2021

[8] [8]

Carlone, L., Kim, A., Barfoot, T., Cremers, D., Dellaert, F.: Slam handbook: From localization and mapping to spatial intelligence (2025)

2025

[9] [9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

Chen, H., Sun, B., Zhang, A., Pollefeys, M., Leutenegger, S.: VidBot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

2025

[10] [10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19844–19853 (2024)

2024

[11] [11]

In: 2026 International Conference on 3D Vision (3DV)

Chi, Y., Sommer, L., Dünkel, O., Muhle, D., Cremers, D., Theobalt, C., Ko- rtylewski, A.: C3po: Canonicalization of 3d pose from partial views with gener- alizable correspondence features. In: 2026 International Conference on 3D Vision (3DV). pp. 587–597. IEEE (2026)

2026

[12] [12]

Chng, C.K., Parra, A., Chin, T.J., Latif, Y.: Monocular rotational odometry with incrementalrotationaveragingandloopclosure.In:2020DigitalImageComputing: Techniques and Applications (DICTA). pp. 1–8. IEEE (2020)

2020

[13] [13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cin, A.P.D., Dikov, G., Ju, J., Ghafoorian, M.: Anymap: Learning a general cam- era model for structure-from-motion with unknown distortion in dynamic scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16674–16684 (2025)

2025

[14] [14]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Deng, K., Ti, Z., Xu, J., Yang, J., Xie, J.: Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences. arXiv preprint arXiv:2507.16443 (2025) VOCA: Visual Odometry with Codec Awareness 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops. pp. 224–236 (2018)

2018

[16] [16]

IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

2017

[17] [17]

In: European conference on computer vision

Engel, J., Schöps, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: European conference on computer vision. pp. 834–849. Springer (2014)

2014

[18] [18]

In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., Huang, G.: Openvins: A research platform for visual-inertial estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 4666–4672. IEEE (2020)

2020

[19] [19]

Advances in Neural Information Processing Systems38, 4989–5014 (2026)

Gross, M., Fahmy, A., Niwattananan, D., Muhle, D., Song, R., Cremers, D., Meeß, H.: Ipformer: Visual 3d panoptic scene completion with context-adaptive instance proposals. Advances in Neural Information Processing Systems38, 4989–5014 (2026)

2026

[20] [20]

Proceedings of the IEEE109(9), 1435–1462 (2021)

Han, J., Li, B., Mukherjee, D., Chiang, C.H., Grange, A., Chen, C., Su, H., Parker, S., Deng, S., Joshi, U., Chen, Y., Wang, Y., Wilkins, P., Xu, Y., Bankoski, J.: A technical overview of av1. Proceedings of the IEEE109(9), 1435–1462 (2021). https://doi.org/10.1109/JPROC.2021.3058584

work page doi:10.1109/jproc.2021.3058584 2021

[21] [21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Han, K., Muhle, D., Wimbauer, F., Cremers, D.: Boosting self-supervision for single-view scene completion via knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9837– 9847 (2024)

2024

[22] [22]

Cam- bridge University Press, Cambridge, 2 edn

Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, Cambridge, 2 edn. (2004).https://doi.org/10.1017/ CBO9780511811685

2004

[23] [23]

In: 2024 International Conference on 3D Vision (3DV)

Hayler, A., Wimbauer, F., Muhle, D., Rupprecht, C., Cremers, D.: S4c: Self- supervised semantic scene completion with neural fields. In: 2024 International Conference on 3D Vision (3DV). pp. 409–420. IEEE (2024)

2024

[24] [24]

In: European Wireless 2023; 28th European Wireless Conference

Hofer, J., et al.: H.264 Compress-Then-Analyze Transmission in Edge-Assisted Visual SLAM. In: European Wireless 2023; 28th European Wireless Conference. pp. 130–135 (2023)

2023

[25] [25]

Hsiao, Y.M., Lee, J.F., Chen, J.S., Chu, Y.S.: Review: H.264 video transmis- sions over wireless networks: Challenges and solutions. Comput. Commun.34(14), 1661–1672 (Sep 2011).https://doi.org/10.1016/j.comcom.2011.03.016

work page doi:10.1016/j.comcom.2011.03.016 2011

[26] [26]

International Telecommunication Union: ITU-T Recommendation H.262: Informa- tion technology – Generic coding of moving pictures and associated audio infor- mation: Video.https://www.itu.int/rec/T- REC- H.262(Jan 2021), accessed: 2026-02-08

2021

[27] [27]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An,P.,Xue,X.,Su,Q.,Lyu,H.,Zheng,X.,Liu,J.,Wang,Z.,Zhang,S.:Robobrain: A unified brain model for robotic manipulation from abstract to concrete. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1724–1734 (2025)

2025

[28] [28]

In: 2014 IEEE Conference on Computer Vision and Pattern Recognition

Kantorov, V., Laptev, I.: Efficient feature extraction, encoding, and classification for action recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2593–2600 (2014).https://doi.org/10.1109/CVPR.2014.332

work page doi:10.1109/cvpr.2014.332 2014

[29] [29]

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Universal feed-forward metric 3d reconstruction; map-anything. github. io. In: 2026 Interna- tional Conference on 3D Vision (3DV). pp. 499–509. IEEE (2026) 18 N. Hilscher et al

2026

[30] [30]

In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality

Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality. pp. 225–234. IEEE (2007)

2007

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, S.H., Civera, J.: Rotation-only bundle adjustment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 424– 433 (2021)

2021

[32] [32]

arXiv preprint arXiv:2202.09199 (2022)

Leutenegger, S.: Okvis2: Realtime scalable visual-inertial slam with loop closure. arXiv preprint arXiv:2202.09199 (2022)

work page arXiv 2022

[33] [33]

In: 2011 International conference on computer vision

Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. Ieee (2011)

2011

[34] [34]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

Lin, L., Yu, S., Zhou, L., Chen, W., Zhao, T., Wang, Z.: Pea265: Perceptual assess- ment of video compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

2020

[36] [36]

Liou, M.: Overview of the p×64 kbit/s video coding standard. Commun. ACM 34(4), 59–63 (Apr 1991).https://doi.org/10.1145/103085.103091

work page doi:10.1145/103085.103091 1991

[37] [37]

Interna- tional journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

2004

[38] [38]

In: Hayes, P.J

Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: Hayes, P.J. (ed.) Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada, August 24-28, 1981. pp. 674–679. William Kaufmann (1981)

1981

[39] [39]

In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

de Mayo, M., Cremers, D., Pire, T.: The monado slam dataset for egocentric visual-inertial tracking. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 13111–13118. IEEE (2025)

2025

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Muhle, D., Koestler, L., Demmel, N., Bernard, F., Cremers, D.: The probabilistic normal epipolar constraint for frame-to-frame rotation optimization under uncer- tain feature positions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1819–1828 (2022)

2022

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Muhle, D., Koestler, L., Jatavallabhula, K.M., Cremers, D.: Learning correspon- dence uncertainty via differentiable nonlinear least squares. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13102– 13112 (2023)

2023

[42] [42]

In: 2013 Picture Coding Symposium (PCS)

Mukherjee, D., Bankoski, J., Grange, A., Han, J., Koleszar, J., Wilkins, P., Xu, Y., Bultje, R.: The latest open-source video codec vp9 - an overview and preliminary results. In: 2013 Picture Coding Symposium (PCS). pp. 390–393 (2013).https: //doi.org/10.1109/PCS.2013.6737765

work page doi:10.1109/pcs.2013.6737765 2013

[43] [43]

Transactions on Robotics (T-RO)31(5), 1147–1163 (2015).https://doi.org/10.1109/TRO.2015.2463671

Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics31(5), 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671

work page doi:10.1109/tro.2015.2463671 2015

[44] [44]

IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monoc- ular, stereo, and rgb-d cameras. IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

work page doi:10.1109/tro.2017.2705103 2017

[45] [45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Murai, R., Dexheimer, E., Davison, A.J.: Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16695–16705 (2025) VOCA: Visual Odometry with Codec Awareness 19

2025

[46] [46]

ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10

Peroni, L., Gorinsky, S.: An end-to-end pipeline perspective on video streaming in best-effort networks: A survey and tutorial. ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10. 1145/3742472

work page doi:10.1145/3742472 2025

[47] [47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3D-MVP: 3D Mul- tiview Pretraining for Manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

2025

[48] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Reich, C., Hahn, O., Cremers, D., Roth, S., Debnath, B.: A perspective on deep vision performance with standard image and video codecs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5712– 5721 (2024)

2024

[49] [49]

In: 2011 International Conference on Computer Vision

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision. pp. 2564–2571 (2011).https://doi.org/10.1109/ICCV.2011.6126544

work page doi:10.1109/iccv.2011.6126544 2011

[50] [50]

In: 2021 International conference on unmanned aircraft systems (ICUAS)

Rückert,D.,Stamminger,M.:Snake-slam:Efficientglobalvisualinertialslamusing decoupled nonlinear optimization. In: 2021 International conference on unmanned aircraft systems (ICUAS). pp. 219–228. IEEE (2021)

2021

[51] [51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Sandström,E.,Zhang,G.,Tateno,K.,Oechsle,M.,Niemeyer,M.,Zhang,Y.,Patel, M., Van Gool, L., Oswald, M., Tombari, F.: Splat-slam: Globally optimized rgb- only slam with 3d gaussians. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1680–1691 (2025)

2025

[52] [52]

Proceedings of the IEEE83(6), 907–924 (1995).https://doi

Schafer, R., Sikora, T.: Digital video coding standards and their role in video communications. Proceedings of the IEEE83(6), 907–924 (1995).https://doi. org/10.1109/5.387092

work page doi:10.1109/5.387092 1995

[53] [53]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

2016

[54] [54]

In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Schubert, D., Goll, T., Demmel, N., Usenko, V., Stückler, J., Cremers, D.: The tum vi benchmark for evaluating visual-inertial odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1680–

2018

[55] [55]

IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014

Seufert, M., Egger, S., Slanina, M., Zinner, T., Hoßfeld, T., Tran-Gia, P.: A survey on quality of experience of http adaptive streaming. IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014. 2360940

work page doi:10.1109/comst.2014 2015

[56] [56]

In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA

Shi, J., Tomasi, C.: Good features to track. In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA. pp. 593–600. IEEE (1994).https://doi.org/10.1109/CVPR.1994.323794,https: //doi.org/10.1109/CVPR.1994.323794

work page doi:10.1109/cvpr.1994.323794 1994

[57] [57]

In: 2025 International Conference on 3D Vision (3DV)

Smith, C., Charatan, D., Tewari, A., Sitzmann, V.: Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. In: 2025 International Conference on 3D Vision (3DV). pp. 389–400. IEEE (2025)

2025

[58] [58]

In: Springer handbook of robotics, pp

Stachniss, C., Leonard, J.J., Thrun, S.: Simultaneous localization and mapping. In: Springer handbook of robotics, pp. 1153–1176. Springer (2016)

2016

[59] [59]

IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT

Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT. 2012.2221191

work page doi:10.1109/tcsvt 2012

[60] [60]

Advances in neural information processing systems34, 16558–16569 (2021) 20 N

Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021) 20 N. Hilscher et al

2021

[61] [61]

Tomasi, C., Kanade, T.: Detection and tracking of point features. Tech. rep., In- ternational Journal of Computer Vision (1991)

1991

[62] [62]

In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC)

Turner, R.N., Banerjee, N.K., Banerjee, S.: Mov-slam: Using motion vectors for real-time single-cpu visual slam. In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC). pp. 51–58. IEEE (2023)

2023

[63] [63]

Ungureanu, D., Bogo, F., Galliani, S., Sama, P., Duan, X., Meekhof, C., Stühmer, J., Cashman, T.J., Tekin, B., Schönberger, J.L., Olszta, P., Pollefeys, M.: HoloLens 2 Research Mode as a Tool for Computer Vision Research (Aug 2020).https: //doi.org/10.48550/arXiv.2008.11239,http://arxiv.org/abs/2008.11239

work page doi:10.48550/arxiv.2008.11239 2020

[64] [64]

IEEE Robotics Autom

Usenko, V., Demmel, N., Schubert, D., Stückler, J., Cremers, D.: Visual-inertial mapping with non-linear factor recovery. IEEE Robotics Autom. Lett.5(2), 422– 429 (2020).https://doi.org/10.1109/LRA.2019.2961227,https://doi.org/ 10.1109/LRA.2019.2961227

work page doi:10.1109/lra.2019.2961227 2020

[65] [65]

IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

Von Stumberg, L., Cremers, D.: Dm-vio: Delayed marginalization visual-inertial odometry. IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

2022

[66] [66]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025

[67] [67]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the h.264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

work page doi:10.1109/tcsvt.2003.815165 2003

[69] [69]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025)

2025

[70] [70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wimbauer, F., Yang, N., Rupprecht, C., Cremers, D.: Behind the scenes: Density fields for single view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9076–9086 (2023)

2023

[71] [71]

In: Proceedings of the 31st ACM International Conference on Multimedia

Zhou, S., Jiang, X., Tan, W., He, R., Yan, B.: Mvflow: Deep optical flow estimation of compressed videos with motion vector prior. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 1964–1974 (2023)

1964

[72] [72]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu, Z., Akkaya, I.B., Waeijen, L., Bondarev, E., Pourtaherian, A., Moreira, O.: MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29309–29320 (Jun 2025).https://doi.org/10.1109/CVPR52734. 2025.02729,https://ieeexplore.ieee.org/document/11092745

work page doi:10.1109/cvpr52734 2025

[73] [73]

In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR)

Zouein, J., Javidnia, H., Pitié, F., Kokaram, A.: Leveraging AV1 Motion Vectors for Fast and Dense Feature Matching. In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR). pp. 1–4.https://doi.org/10.1109/ICIR68135. 2025.11361611

work page doi:10.1109/icir68135 2025

[74] [74]

fast" -tune

Zouein, J., Vibhoothi, V., Kokaram, A.: AV1 Motion Vector Fidelity and Applica- tionforEfficientOpticalFlow.In:2025PictureCodingSymposium(PCS).pp.1–5. https://doi.org/10.1109/PCS65673.2025.11417638 VOCA: Visual Odometry with Codec Awareness 1 A Metrics In this paper, we utilize two commonly used metrics to evaluate the performance of Visual Odometry algor...

work page doi:10.1109/pcs65673.2025.11417638 2025