pith. sign in

arxiv: 2607.00189 · v1 · pith:7C3RSH6Ynew · submitted 2026-06-30 · 💻 cs.CV

VOCA: Visual Odometry with Codec Awareness

Pith reviewed 2026-07-02 19:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual odometrycodec awarenesscompressed videostereo trackingcamera pose estimationcausal VOvideo compression artifacts
0
0 comments X

The pith

VOCA improves causal stereo visual odometry on compressed video by incorporating codec information to reduce artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VOCA as a method for camera pose estimation from compressed image streams rather than the raw video used in most prior visual odometry work. Lossy compression creates visual artifacts that degrade standard tracking performance, and the approach shows that codec data can be extracted and used directly inside a causal stereo pipeline to counteract those effects. On standard benchmarks this yields better relative trajectory error, absolute trajectory error, and efficiency than existing causal methods when streams are compressed.

Core claim

VOCA is a causal stereo visual-odometry method that exploits codec information to improve tracking performance. It achieves state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams.

What carries the argument

The VOCA pipeline, which feeds codec-derived information into a causal stereo visual-odometry estimator to mitigate compression artifacts during pose tracking.

If this is right

  • Real-world spatial world models that receive compressed video can maintain higher tracking accuracy without increasing bandwidth or storage.
  • Perception modules inside planning systems can operate directly on hardware-decoded streams rather than requiring raw frames.
  • Efficiency gains allow longer operation on resource-constrained platforms that already use video codecs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same codec signals could be tested on other perception tasks such as depth estimation or object tracking that currently assume uncompressed input.
  • Extending the approach beyond stereo to monocular or multi-camera setups would test whether codec awareness generalizes when baseline geometry changes.
  • Integration with learned feature detectors might further reduce reliance on hand-crafted handling of compression noise.

Load-bearing premise

That codec information can be extracted and leveraged in a causal stereo setup to measurably mitigate compression artifacts and produce superior trajectory estimates compared to existing methods on standard benchmarks.

What would settle it

Running VOCA and prior causal stereo methods on the same compressed benchmark streams while withholding all codec metadata and measuring whether the performance gap disappears.

Figures

Figures reproduced from arXiv: 2607.00189 by Christoph Otten genannt Hermes, Daniel Cremers, Dominik Muhle, Mateo de Mayo, Nouri Alexander Hilscher.

Figure 1
Figure 1. Figure 1: Visual Odometry on Compressed Videos. We present VOCA, a novel Visual Odometry system that produces smoother, more stable trajectories than descriptor-based systems such as ORB-SLAM3 and OKVIS2, thanks to its codec-aware sparse optical-flow frontend. Our system enables Visual Odometry on data compressed by up to 100 ×. We visualize challenging segments with dashed-red markers, sampled from three different … view at source ↗
Figure 2
Figure 2. Figure 2: Video encoding can introduce artifacts that violate the photometric-constancy assumption used by most tracking algorithms. Examples from datasets used in this work: EV203 (from EuRoC [6]) shows blurred details, TR2 (from TUM-VI room [54]) exhibits reduced contrast in the thin net, and MOO02 (from MSD [39]) contains geo￾metrically jagged edges/textures. See [35] for additional artifacts. causal estimates, m… view at source ↗
Figure 3
Figure 3. Figure 3: Encoded information. Video codecs assign motion vectors to macroblock partitions and sub-partitions. They encode local block motion as a displacement to an intensity-matched block in a reference frame. The figure shows a frame with macroblock partitions (left), its motion vectors (center), and the normalized discrete vector field they induce (right). While this field is correlated with optical flow, it is … view at source ↗
Figure 4
Figure 4. Figure 4: Proximity to the minimum. VOCA uses motion vectors as priors for opti￾cal flow, which significantly reduces the distance to the ground-truth solution, possibly improving convergence times and the likelihood that the initial state lies in the con￾vergence basin of the non-linear problem. This figure shows the error distributions of pixel distances between the prior guesses (initialization) and the final tra… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative trajectory examples. To showcase the quality of VOCA trajec￾tories, in addition to [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MSD overview. We indicate divergences and resets with gray ∞ and ⟳, respectively. First-place ranks for each sequence are shown in green, and non-first ranks are shown in blue. See the supplementary for the detailed numbers of each run. 4.5 Ablation Study We consider three strategies for integrating motion vectors (MV) into tracking, balancing the strength of the MV prior against the risk of propagating ou… view at source ↗
Figure 7
Figure 7. Figure 7: Median ATE/RTE vs. Bitrate. Lower bitrates lead to smaller file sizes. Numbers indicate a success rate of less than 100% for ATE. VOCA shows stable per￾formance even at low bitrates. As a tracking-based system, Basalt [64] suffers from com￾pression artifacts, especially on the TUM-VI dataset. ORB-SLAM3 [7] shows steady performance in ATE on TUM-VI but delivers poor performance in RTE. This indicates good g… view at source ↗
Figure 5
Figure 5. Figure 5: Symbols and shading follow the conventions of Tab. 5. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: H.264 and AV1 motion-vector priors on TUM-VI Room. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Camera pose estimation from image streams is a critical component of spatial world models that integrate perception into planning and decision-making. Nearly all Visual Odometry (VO) and Simultaneous Localization and Mapping (V-SLAM) systems have focused on datasets containing raw, uncompressed videos. Many working systems instead use ubiquitous hardware units to efficiently compress and decode video streams, saving orders of magnitude in storage and bandwidth. However, this lossy compression introduces visual artifacts that hinder the performance of traditional tracking systems. We present VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. We achieve state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams. This work highlights the potential of leveraging widely available video codec information for vision tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. It claims state-of-the-art results for relative trajectory error, efficiency, and absolute trajectory error on compressed streams, highlighting the use of widely available video codec side information for vision tasks.

Significance. If the results hold, the work addresses a practical gap by showing how codec side information can mitigate compression artifacts in causal stereo VO, which could improve robustness and efficiency in real-world systems that rely on compressed video rather than raw streams.

major comments (1)
  1. [Abstract] Abstract: the central claim of achieving SOTA performance on RTE, efficiency, and ATE for causal VO on compressed streams is unsupported, as the text provides no experimental setup, datasets, baselines, quantitative results, or validation details.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the practical relevance of exploiting codec side information in causal stereo visual odometry. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of achieving SOTA performance on RTE, efficiency, and ATE for causal VO on compressed streams is unsupported, as the text provides no experimental setup, datasets, baselines, quantitative results, or validation details.

    Authors: Abstracts are concise summaries and do not contain experimental details by design. The full manuscript provides the requested information: Section 4 describes the experimental setup and datasets (including compressed streams from standard benchmarks), Section 5 details the baselines and quantitative comparisons, and Tables 1-3 report the RTE, efficiency, and ATE results demonstrating state-of-the-art performance for causal VO. These sections validate the abstract claims with specific metrics and ablation studies. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical method for causal stereo visual odometry that incorporates codec side information to mitigate compression artifacts. No derivation chain, equations, fitted parameters, uniqueness theorems, or ansatzes are presented in the provided text. Claims reduce to experimental results on standard benchmarks (RTE, ATE, efficiency), which are externally falsifiable and not equivalent to inputs by construction. No self-citation load-bearing steps or renamings of known results appear. This is the expected outcome for a systems/implementation paper without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5672 in / 991 out tokens · 27739 ms · 2026-07-02T19:21:02.402053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN)

    Arunruangsirilert, K., Katto, J.: Evaluation of hardware-based video encoders on modern gpus for uhd live-streaming. In: 2024 33rd International Conference on Computer Communications and Networks (ICCCN). pp. 1–9 (2024).https:// doi.org/10.1109/ICCCN61486.2024.10637525

  2. [2]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from Human Videos as a Versatile Representation for Robotics. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 01–13. IEEE, Vancou- ver, BC, Canada (Jun 2023).https://doi.org/10.1109/CVPR52729.2023.01324

  3. [3]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., Zhang, L., Fountain, J., Miller, E., Basol, S., Newcombe, R., Wang, R., Engel, J.J., Hodan, T.: HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7061–7071 (Jun 2025).https://d...

  4. [4]

    In: Euro- pean conference on computer vision

    Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Euro- pean conference on computer vision. pp. 404–417. Springer (2006)

  5. [5]

    IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

    Bross, B., Wang, Y.K., Ye, Y., Liu, S., Chen, J., Sullivan, G.J., Ohm, J.R.: Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology31(10), 3736–3764 (2021).https://doi.org/10.1109/TCSVT.2021.3101953

  6. [6]

    The International Journal of Robotics Research35(10), 1157–1163 (2016)

    Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari, S., Achtelik, M.W., Siegwart, R.: The euroc micro aerial vehicle datasets. The International Journal of Robotics Research35(10), 1157–1163 (2016)

  7. [7]

    IEEE transactions on robotics37(6), 1874–1890 (2021)

    Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)

  8. [8]

    Carlone, L., Kim, A., Barfoot, T., Cremers, D., Dellaert, F.: Slam handbook: From localization and mapping to spatial intelligence (2025)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

    Chen, H., Sun, B., Zhang, A., Pollefeys, M., Leutenegger, S.: VidBot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19844–19853 (2024)

  11. [11]

    In: 2026 International Conference on 3D Vision (3DV)

    Chi, Y., Sommer, L., Dünkel, O., Muhle, D., Cremers, D., Theobalt, C., Ko- rtylewski, A.: C3po: Canonicalization of 3d pose from partial views with gener- alizable correspondence features. In: 2026 International Conference on 3D Vision (3DV). pp. 587–597. IEEE (2026)

  12. [12]

    Chng, C.K., Parra, A., Chin, T.J., Latif, Y.: Monocular rotational odometry with incrementalrotationaveragingandloopclosure.In:2020DigitalImageComputing: Techniques and Applications (DICTA). pp. 1–8. IEEE (2020)

  13. [13]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cin, A.P.D., Dikov, G., Ju, J., Ghafoorian, M.: Anymap: Learning a general cam- era model for structure-from-motion with unknown distortion in dynamic scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16674–16684 (2025)

  14. [14]

    VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

    Deng, K., Ti, Z., Xu, J., Yang, J., Xie, J.: Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences. arXiv preprint arXiv:2507.16443 (2025) VOCA: Visual Odometry with Codec Awareness 17

  15. [15]

    In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops

    DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops. pp. 224–236 (2018)

  16. [16]

    IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence40(3), 611–625 (2017)

  17. [17]

    In: European conference on computer vision

    Engel, J., Schöps, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: European conference on computer vision. pp. 834–849. Springer (2014)

  18. [18]

    In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

    Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., Huang, G.: Openvins: A research platform for visual-inertial estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 4666–4672. IEEE (2020)

  19. [19]

    Advances in Neural Information Processing Systems38, 4989–5014 (2026)

    Gross, M., Fahmy, A., Niwattananan, D., Muhle, D., Song, R., Cremers, D., Meeß, H.: Ipformer: Visual 3d panoptic scene completion with context-adaptive instance proposals. Advances in Neural Information Processing Systems38, 4989–5014 (2026)

  20. [20]

    Proceedings of the IEEE109(9), 1435–1462 (2021)

    Han, J., Li, B., Mukherjee, D., Chiang, C.H., Grange, A., Chen, C., Su, H., Parker, S., Deng, S., Joshi, U., Chen, Y., Wang, Y., Wilkins, P., Xu, Y., Bankoski, J.: A technical overview of av1. Proceedings of the IEEE109(9), 1435–1462 (2021). https://doi.org/10.1109/JPROC.2021.3058584

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Han, K., Muhle, D., Wimbauer, F., Cremers, D.: Boosting self-supervision for single-view scene completion via knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9837– 9847 (2024)

  22. [22]

    Cam- bridge University Press, Cambridge, 2 edn

    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, Cambridge, 2 edn. (2004).https://doi.org/10.1017/ CBO9780511811685

  23. [23]

    In: 2024 International Conference on 3D Vision (3DV)

    Hayler, A., Wimbauer, F., Muhle, D., Rupprecht, C., Cremers, D.: S4c: Self- supervised semantic scene completion with neural fields. In: 2024 International Conference on 3D Vision (3DV). pp. 409–420. IEEE (2024)

  24. [24]

    In: European Wireless 2023; 28th European Wireless Conference

    Hofer, J., et al.: H.264 Compress-Then-Analyze Transmission in Edge-Assisted Visual SLAM. In: European Wireless 2023; 28th European Wireless Conference. pp. 130–135 (2023)

  25. [25]

    Hsiao, Y.M., Lee, J.F., Chen, J.S., Chu, Y.S.: Review: H.264 video transmis- sions over wireless networks: Challenges and solutions. Comput. Commun.34(14), 1661–1672 (Sep 2011).https://doi.org/10.1016/j.comcom.2011.03.016

  26. [26]

    International Telecommunication Union: ITU-T Recommendation H.262: Informa- tion technology – Generic coding of moving pictures and associated audio infor- mation: Video.https://www.itu.int/rec/T- REC- H.262(Jan 2021), accessed: 2026-02-08

  27. [27]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An,P.,Xue,X.,Su,Q.,Lyu,H.,Zheng,X.,Liu,J.,Wang,Z.,Zhang,S.:Robobrain: A unified brain model for robotic manipulation from abstract to concrete. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1724–1734 (2025)

  28. [28]

    In: 2014 IEEE Conference on Computer Vision and Pattern Recognition

    Kantorov, V., Laptev, I.: Efficient feature extraction, encoding, and classification for action recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2593–2600 (2014).https://doi.org/10.1109/CVPR.2014.332

  29. [29]

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Universal feed-forward metric 3d reconstruction; map-anything. github. io. In: 2026 Interna- tional Conference on 3D Vision (3DV). pp. 499–509. IEEE (2026) 18 N. Hilscher et al

  30. [30]

    In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality

    Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality. pp. 225–234. IEEE (2007)

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lee, S.H., Civera, J.: Rotation-only bundle adjustment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 424– 433 (2021)

  32. [32]

    arXiv preprint arXiv:2202.09199 (2022)

    Leutenegger, S.: Okvis2: Realtime scalable visual-inertial slam with loop closure. arXiv preprint arXiv:2202.09199 (2022)

  33. [33]

    In: 2011 International conference on computer vision

    Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. Ieee (2011)

  34. [34]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  35. [35]

    IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

    Lin, L., Yu, S., Zhou, L., Chen, W., Zhao, T., Wang, Z.: Pea265: Perceptual assess- ment of video compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020)

  36. [36]

    Liou, M.: Overview of the p×64 kbit/s video coding standard. Commun. ACM 34(4), 59–63 (Apr 1991).https://doi.org/10.1145/103085.103091

  37. [37]

    Interna- tional journal of computer vision60(2), 91–110 (2004)

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

  38. [38]

    In: Hayes, P.J

    Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: Hayes, P.J. (ed.) Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada, August 24-28, 1981. pp. 674–679. William Kaufmann (1981)

  39. [39]

    In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

    de Mayo, M., Cremers, D., Pire, T.: The monado slam dataset for egocentric visual-inertial tracking. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 13111–13118. IEEE (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Muhle, D., Koestler, L., Demmel, N., Bernard, F., Cremers, D.: The probabilistic normal epipolar constraint for frame-to-frame rotation optimization under uncer- tain feature positions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1819–1828 (2022)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Muhle, D., Koestler, L., Jatavallabhula, K.M., Cremers, D.: Learning correspon- dence uncertainty via differentiable nonlinear least squares. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13102– 13112 (2023)

  42. [42]

    In: 2013 Picture Coding Symposium (PCS)

    Mukherjee, D., Bankoski, J., Grange, A., Han, J., Koleszar, J., Wilkins, P., Xu, Y., Bultje, R.: The latest open-source video codec vp9 - an overview and preliminary results. In: 2013 Picture Coding Symposium (PCS). pp. 390–393 (2013).https: //doi.org/10.1109/PCS.2013.6737765

  43. [43]

    Transactions on Robotics (T-RO)31(5), 1147–1163 (2015).https://doi.org/10.1109/TRO.2015.2463671

    Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics31(5), 1147–1163 (2015). https://doi.org/10.1109/TRO.2015.2463671

  44. [44]

    IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

    Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monoc- ular, stereo, and rgb-d cameras. IEEE Transactions on Robotics33(5), 1255–1262 (2017).https://doi.org/10.1109/TRO.2017.2705103

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Murai, R., Dexheimer, E., Davison, A.J.: Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16695–16705 (2025) VOCA: Visual Odometry with Codec Awareness 19

  46. [46]

    ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10

    Peroni, L., Gorinsky, S.: An end-to-end pipeline perspective on video streaming in best-effort networks: A survey and tutorial. ACM Computing Surveys57(12), 1–47 (Jul 2025).https://doi.org/10.1145/3742472,http://dx.doi.org/10. 1145/3742472

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

    Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3D-MVP: 3D Mul- tiview Pretraining for Manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2025)

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Reich, C., Hahn, O., Cremers, D., Roth, S., Debnath, B.: A perspective on deep vision performance with standard image and video codecs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5712– 5721 (2024)

  49. [49]

    In: 2011 International Conference on Computer Vision

    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision. pp. 2564–2571 (2011).https://doi.org/10.1109/ICCV.2011.6126544

  50. [50]

    In: 2021 International conference on unmanned aircraft systems (ICUAS)

    Rückert,D.,Stamminger,M.:Snake-slam:Efficientglobalvisualinertialslamusing decoupled nonlinear optimization. In: 2021 International conference on unmanned aircraft systems (ICUAS). pp. 219–228. IEEE (2021)

  51. [51]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sandström,E.,Zhang,G.,Tateno,K.,Oechsle,M.,Niemeyer,M.,Zhang,Y.,Patel, M., Van Gool, L., Oswald, M., Tombari, F.: Splat-slam: Globally optimized rgb- only slam with 3d gaussians. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1680–1691 (2025)

  52. [52]

    Proceedings of the IEEE83(6), 907–924 (1995).https://doi

    Schafer, R., Sikora, T.: Digital video coding standards and their role in video communications. Proceedings of the IEEE83(6), 907–924 (1995).https://doi. org/10.1109/5.387092

  53. [53]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

  54. [54]

    In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Schubert, D., Goll, T., Demmel, N., Usenko, V., Stückler, J., Cremers, D.: The tum vi benchmark for evaluating visual-inertial odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1680–

  55. [55]

    IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014

    Seufert, M., Egger, S., Slanina, M., Zinner, T., Hoßfeld, T., Tran-Gia, P.: A survey on quality of experience of http adaptive streaming. IEEE Communications Sur- veys & Tutorials17(1), 469–492 (2015).https://doi.org/10.1109/COMST.2014. 2360940

  56. [56]

    In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA

    Shi, J., Tomasi, C.: Good features to track. In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA. pp. 593–600. IEEE (1994).https://doi.org/10.1109/CVPR.1994.323794,https: //doi.org/10.1109/CVPR.1994.323794

  57. [57]

    In: 2025 International Conference on 3D Vision (3DV)

    Smith, C., Charatan, D., Tewari, A., Sitzmann, V.: Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. In: 2025 International Conference on 3D Vision (3DV). pp. 389–400. IEEE (2025)

  58. [58]

    In: Springer handbook of robotics, pp

    Stachniss, C., Leonard, J.J., Thrun, S.: Simultaneous localization and mapping. In: Springer handbook of robotics, pp. 1153–1176. Springer (2016)

  59. [59]

    IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT

    Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology22(12), 1649–1668 (2012).https://doi.org/10.1109/TCSVT. 2012.2221191

  60. [60]

    Advances in neural information processing systems34, 16558–16569 (2021) 20 N

    Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021) 20 N. Hilscher et al

  61. [61]

    Tomasi, C., Kanade, T.: Detection and tracking of point features. Tech. rep., In- ternational Journal of Computer Vision (1991)

  62. [62]

    In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC)

    Turner, R.N., Banerjee, N.K., Banerjee, S.: Mov-slam: Using motion vectors for real-time single-cpu visual slam. In: 2023 Seventh IEEE International Conference on Robotic Computing (IRC). pp. 51–58. IEEE (2023)

  63. [63]

    Ungureanu, D., Bogo, F., Galliani, S., Sama, P., Duan, X., Meekhof, C., Stühmer, J., Cashman, T.J., Tekin, B., Schönberger, J.L., Olszta, P., Pollefeys, M.: HoloLens 2 Research Mode as a Tool for Computer Vision Research (Aug 2020).https: //doi.org/10.48550/arXiv.2008.11239,http://arxiv.org/abs/2008.11239

  64. [64]

    IEEE Robotics Autom

    Usenko, V., Demmel, N., Schubert, D., Stückler, J., Cremers, D.: Visual-inertial mapping with non-linear factor recovery. IEEE Robotics Autom. Lett.5(2), 422– 429 (2020).https://doi.org/10.1109/LRA.2019.2961227,https://doi.org/ 10.1109/LRA.2019.2961227

  65. [65]

    IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

    Von Stumberg, L., Cremers, D.: Dm-vio: Delayed marginalization visual-inertial odometry. IEEE Robotics and Automation Letters7(2), 1408–1415 (2022)

  66. [66]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

  67. [67]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

  68. [68]

    IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

    Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the h.264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Tech- nology13(7), 560–576 (2003).https://doi.org/10.1109/TCSVT.2003.815165

  69. [69]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025)

  70. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wimbauer, F., Yang, N., Rupprecht, C., Cremers, D.: Behind the scenes: Density fields for single view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9076–9086 (2023)

  71. [71]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Zhou, S., Jiang, X., Tan, W., He, R., Yan, B.: Mvflow: Deep optical flow estimation of compressed videos with motion vector prior. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 1964–1974 (2023)

  72. [72]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhu, Z., Akkaya, I.B., Waeijen, L., Bondarev, E., Pourtaherian, A., Moreira, O.: MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29309–29320 (Jun 2025).https://doi.org/10.1109/CVPR52734. 2025.02729,https://ieeexplore.ieee.org/document/11092745

  73. [73]

    In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR)

    Zouein, J., Javidnia, H., Pitié, F., Kokaram, A.: Leveraging AV1 Motion Vectors for Fast and Dense Feature Matching. In: 2025 IEEE 4th International Conference on Intelligent Reality (ICIR). pp. 1–4.https://doi.org/10.1109/ICIR68135. 2025.11361611

  74. [74]

    fast" -tune

    Zouein, J., Vibhoothi, V., Kokaram, A.: AV1 Motion Vector Fidelity and Applica- tionforEfficientOpticalFlow.In:2025PictureCodingSymposium(PCS).pp.1–5. https://doi.org/10.1109/PCS65673.2025.11417638 VOCA: Visual Odometry with Codec Awareness 1 A Metrics In this paper, we utilize two commonly used metrics to evaluate the performance of Visual Odometry algor...