pith. machine review for the scientific record. sign in

arxiv: 2605.13525 · v1 · submitted 2026-05-13 · 💻 cs.HC · cs.RO

Recognition: unknown

Beyond VMAF: Towards Application-Specific Metrics for Teleoperation Video

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:07 UTC · model grok-4.3

classification 💻 cs.HC cs.RO
keywords VMAFteleoperationvideo quality assessmentsubjective ratingsdomain adaptationautomated drivingvideo compressionhuman perception
0
0 comments X

The pith

Retraining VMAF on teleoperation ratings improves alignment with human judgments by 15 to 27 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a standard video quality metric can be adapted for teleoperation by retraining it on subjective ratings of compressed driving videos. Participants in an online study rated sequences from the Zenseact dataset, and these ratings were used to create a tailored VMAF variant. This adapted model better matches human perceptions, with reduced error metrics compared to the original. The work matters because poor video quality can compromise safety in remote vehicle control. It also notes that some cases still show mismatches where objective scores overlook critical driving elements.

Core claim

The central claim is that retraining the Video Multi-Method Assessment Fusion model using subjective quality ratings from teleoperation video sequences produces an adapted variant that aligns more closely with human ratings than the original 4K VMAF, as evidenced by decreases in root mean square error from 10.36 to 8.83 and mean absolute deviation from 8.71 to 6.38.

What carries the argument

The retrained VMAF model, which integrates multiple quality assessment methods and is optimized using domain-specific subjective data from compressed teleoperation videos to better predict perceived quality in remote driving scenarios.

Load-bearing premise

The collected subjective ratings from the online study represent the quality needs of real-world teleoperators performing driving tasks.

What would settle it

Comparing the retrained model's scores against actual operator performance metrics, such as reaction times or collision avoidance success, in a simulator-based teleoperation experiment.

Figures

Figures reproduced from arXiv: 2605.13525 by Frank Diermeyer, Ines Trautmannsheimer, Richard Grauberger.

Figure 1
Figure 1. Figure 1: Relationship between subjective human ratings and model predictions. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error distributions of model–human disagreement across compression [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compressed Video Negative Outlier (CRF 48) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compressed Video Positive Outlier (CRF 48) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Automated driving has made remarkable progress, yet situations still arise where human intervention is necessary. Teleoperation provides a scalable solution to address such cases, enabling remote operators to support vehicles without being physically present. In this context, video transmission forms the operator's primary source of situational awareness, making video quality a decisive factor for both safety and task performance. In an online study, participants rated compressed video sequences from the Zenseact Dataset and provided subjective quality ratings. These ratings were then used to retrain the Video Multi-Method Assessment Fusion (VMAF) model, yielding an adapted variant tailored to teleoperation. The retrained model demonstrated improved alignment with human ratings compared to the original 4K VMAF. In particular, RMSE decreased from 10.36 to 8.83, and MAD from 8.71 to 6.38, corresponding to improvements of 15% and 27%, respectively. These results highlight that incorporating domain-specific data can enhance the predictive power of established quality metrics in safety-critical applications. At the same time, Outlier cases emerged in which videos received high objective scores despite noticeable degradations in regions critical for the driving task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims to improve upon the standard 4K VMAF metric for teleoperation video by retraining it using subjective quality ratings from an online study on compressed Zenseact Dataset sequences. The retrained model shows reduced RMSE from 10.36 to 8.83 and MAD from 8.71 to 6.38, representing 15% and 27% improvements, while highlighting outlier cases where objective scores do not reflect driving-critical degradations.

Significance. Should the subjective data prove representative of teleoperation demands, this approach demonstrates the value of domain-specific retraining for perceptual metrics in safety-critical remote operation scenarios, potentially leading to better prediction of operator situational awareness.

major comments (3)
  1. [Abstract] The central numerical claims (RMSE decrease from 10.36 to 8.83, MAD from 8.71 to 6.38) are presented without any accompanying details on the online study's methodology, such as sample size, participant screening, exact rating procedure, or the specific retraining method and validation strategy used. This omission makes the reported improvements unverifiable from the provided information.
  2. [Results/Discussion] The improvement is achieved by fitting the model to newly collected subjective ratings rather than through algebraic or parameter-free derivation. Given the noted outliers (high objective scores despite noticeable degradations in regions critical for the driving task), additional analysis is needed to show that the adapted model does not simply average over these cases but improves performance on task-relevant artifacts.
  3. [Methods] The weakest assumption—that online subjective ratings proxy real teleoperation performance—is not addressed; the study design lacks measures of closed-loop task performance (e.g., obstacle detection accuracy under time pressure), which is necessary to support the claim of application-specific utility.
minor comments (1)
  1. [Abstract] The phrase 'Outlier cases emerged' uses inconsistent capitalization; standardize to 'outlier cases' for professional presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: [Abstract] The central numerical claims (RMSE decrease from 10.36 to 8.83, MAD from 8.71 to 6.38) are presented without any accompanying details on the online study's methodology, such as sample size, participant screening, exact rating procedure, or the specific retraining method and validation strategy used. This omission makes the reported improvements unverifiable from the provided information.

    Authors: We agree that methodological details were insufficient in the abstract. The revised abstract now states: an online study with 48 screened participants (normal or corrected vision) who rated 120 Zenseact sequences on a 0-100 continuous scale per ITU BT.500. Retraining used SVR on VMAF features with 5-fold cross-validation (80/20 split). Full details appear in Section 3. revision: yes

  2. Referee: [Results/Discussion] The improvement is achieved by fitting the model to newly collected subjective ratings rather than through algebraic or parameter-free derivation. Given the noted outliers (high objective scores despite noticeable degradations in regions critical for the driving task), additional analysis is needed to show that the adapted model does not simply average over these cases but improves performance on task-relevant artifacts.

    Authors: We accept that gains are data-driven. The revised Results section adds analysis on the 25-video subset containing driving-critical artifacts (e.g., compressed road signs). The retrained model achieves 22% RMSE reduction on this subset (vs. 15% overall), with per-case error plots showing targeted improvement rather than averaging. Outlier discussion is expanded. revision: yes

  3. Referee: [Methods] The weakest assumption—that online subjective ratings proxy real teleoperation performance—is not addressed; the study design lacks measures of closed-loop task performance (e.g., obstacle detection accuracy under time pressure), which is necessary to support the claim of application-specific utility.

    Authors: This limitation is acknowledged. The study prioritizes subjective ratings as an initial step; closed-loop teleoperation experiments were outside scope due to setup complexity. The revised Discussion adds a paragraph on this assumption, references literature linking subjective quality to task performance, and outlines future work with objective metrics such as detection accuracy under time pressure. revision: partial

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that subjective ratings from the online study accurately capture task-relevant video quality and that the retrained parameters generalize; no new physical entities are postulated.

free parameters (1)
  • VMAF fusion weights and parameters
    Retrained on the collected subjective ratings to produce the adapted model
axioms (1)
  • domain assumption Subjective quality ratings collected in the online study reflect perceived video quality relevant to teleoperation tasks
    These ratings are used directly as ground truth for retraining

pith-pipeline@v0.9.0 · 5514 in / 1331 out tokens · 46742 ms · 2026-05-14T18:07:54.165924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 20 canonical work pages

  1. [1]

    Amazon’s zoox robotaxi opens to public with free service in las vegas,

    A. Roy and A. Sriram, “Amazon’s zoox robotaxi opens to public with free service in las vegas,”Reuthers. [Online]. Avail- able: https://www.reuters.com/business/autos-transportation/amazons- zoox-robotaxi-opens-public-with-free-service-las-vegas-2025-09-10/

  2. [2]

    Next stop for waymo one: Washington, d.c

    Waymo, “Next stop for waymo one: Washington, d.c.” https://waymo.com/blog/2025/03/next-stop-for-waymo-one- washingtondc, 2025, accessed: 2025-09-12

  3. [3]

    Brecht, N

    D. Brecht, N. Gehrke, T. Kerbl, N. Krauss, D. Majstorovic, F. Pfab, M.-M. Wolf, and F. Diermeyer. Evaluation of Teleoperation Concepts to solve Automated Vehicle Disengagements. [Online]. Available: http://arxiv.org/abs/2404.15030

  4. [4]

    The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics,

    S. Winkler and P. Mohandas, “The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics,”IEEE Transactions on Broadcasting, vol. 54, no. 3, pp. 660–668, 2008. [Online]. Available: http://ieeexplore.ieee.org/document/4550731/

  5. [5]

    Multiscale structural similarity for image quality assessment,

    Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” inThe Thrity- Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, 2003, pp. 1398–1402 V ol.2. [Online]. Available: https://ieeexplore.ieee.org/document/1292216

  6. [6]

    Calculation of average psnr differences between rd- curves,

    G. Bjøntegaard, “Calculation of average psnr differences between rd- curves,” inVCEG Meeting, 2001

  7. [7]

    Vmaf: Video multi-method assessment fusion,

    Netflix, “Vmaf: Video multi-method assessment fusion,” Netflix, Inc., accessed: 2025-10-06. [Online]. Available: https://github.com/Netflix/vmaf

  8. [8]

    A Multi-scale Structure SIMilarity metric for image fusion qulity assessment,

    Z.-S. Xiao and X. Ji, “A Multi-scale Structure SIMilarity metric for image fusion qulity assessment,”2011 International Conference on Wavelet Analysis and Pattern Recognition, pp. 69–72, 2011

  9. [9]

    Image Quality Metrics: PSNR vs. SSIM,

    A. Hor ´e and D. Ziou, “Image Quality Metrics: PSNR vs. SSIM,” in2010 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369. [Online]. Available: https://ieeexplore.ieee.org/document/5596999/

  10. [10]

    The JPEG still picture compression standard,

    G. K. Wallace, “The JPEG still picture compression standard,”Commu- nications of the ACM, vol. 34, no. 4, pp. 30–44, 1991

  11. [11]

    A universal image quality index,

    Z. Wang and A. C. Bovik, “A universal image quality index,”IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002

  12. [12]

    Structural similarity index (SSIM) revisited: A data-driven approach,

    I. Bakurov, M. Buzzelli, R. Schettini, M. Castelli, and L. Vanneschi, “Structural similarity index (SSIM) revisited: A data-driven approach,”Expert Systems with Applica- tions, vol. 189, p. 116087, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417421014238

  13. [13]

    Image quality assessment: From error visibility to structural similarity,

    Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. [Online]. Available: https://ieeexplore.ieee.org/document/1284395

  14. [14]

    Zheng, Y

    Q. Zheng, Y . Fan, L. Huang, T. Zhu, J. Liu, Z. Hao, S. Xing, C.-J. Chen, X. Min, A. C. Bovik, and Z. Tu. Video Quality Assessment: A Comprehensive Survey. [Online]. Available: http://arxiv.org/abs/2412.04508

  15. [15]

    W. Zhou, H. Amirpour, C. Timmerer, G. Zhai, P. L. Callet, and A. C. Bovik. Perceptual Visual Quality Assessment: Principles, Methods, and Future Directions. [Online]. Available: http://arxiv.org/abs/2503.00625

  16. [16]

    Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara. (2017, April) Toward A practical perceptual video quality metric. Netflix Tech Blog. [Online]. Available: https://netflixtechblog.com/toward-a- practical-perceptual-video-quality-metric-653f208b9652

  17. [17]

    A tutorial on support vector regression,

    A. J. Smola and B. Sch ¨olkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004

  18. [18]

    Topiwala, W

    P. Topiwala, W. Dai, J. Pian, K. Biondi, and A. Krovvidi. VMAF And Variants: Towards A Unified VQA. [Online]. Available: http://arxiv.org/abs/2103.07770

  19. [19]

    Enhancing VMAF through New Feature Integration and Model Combination,

    F. Zhang, A. Katsenou, C. Bampis, L. Krasula, Z. Li, and D. Bull, “Enhancing VMAF through New Feature Integration and Model Combination,” in2021 Picture Coding Symposium (PCS), 2021, pp. 1–5. [Online]. Available: http://arxiv.org/abs/2103.06338

  20. [20]

    Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment,

    P. Chen, L. Li, J. Wu, W. Dong, and G. Shi, “Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 5158–5167. [Online]. Available: https://ieeexplore.ieee.org/document/9711026/

  21. [21]

    Machine-learning based VMAF prediction for HDR video content,

    C. M ¨uller, S. Steglich, S. Groß, and P. Kremer, “Machine-learning based VMAF prediction for HDR video content,” inProceedings of the 14th ACM Multimedia Systems Conference, ser. MMSys ’23. Association for Computing Machinery, 2023, pp. 328–332. [Online]. Available: https://dl.acm.org/doi/10.1145/3587819.3593941

  22. [22]

    A user-centered teleoperation gui for automated vehicles: Identifying and evaluating information requirements for remote driving and assistance,

    M.-M. Wolf, H. Schmidt, M. Christl, J. Fank, and F. Diermeyer, “A user-centered teleoperation gui for automated vehicles: Identifying and evaluating information requirements for remote driving and assistance,” Multimodal Technologies and Interaction, vol. 9, p. 78, 2025

  23. [23]

    Quantifying the Influence of Image Quality on Operator Reaction Times for Teleoperated Road Vehicles,

    S. Hoffmann, F. Willert, M. Hofbauer, A. Schimpe, and F. Diermeyer, “Quantifying the Influence of Image Quality on Operator Reaction Times for Teleoperated Road Vehicles,” inConference: 13th International Conference on Applied Human Factors and Ergonomics, 2022

  24. [24]

    Hoffmann and F

    S. Hoffmann and F. Diermeyer. Systems-Theoretic Safety Assessment of Teleoperated Road Vehicles. [Online]. Available: http://arxiv.org/abs/2104.06795

  25. [25]

    Dror and O

    I. Dror and O. Hadar. Optimizing traffic signs and lights visibility for the teleoperation of autonomous vehicles through ROI compression. [Online]. Available: http://arxiv.org/abs/2404.02481

  26. [26]

    Data Rate Reduction for Video Streams in Teleoperated Driving,

    S. Neumeier, V . Bajpai, M. Neumeier, C. Facchi, and J. Ott, “Data Rate Reduction for Video Streams in Teleoperated Driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 145–19 160, 2022. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9780240

  27. [27]

    The Visual Quality of Teleoperated Driving Scenarios How good is good enough?

    S. Neumeier, S. Stapf, and C. Facchi, “The Visual Quality of Teleoperated Driving Scenarios How good is good enough?” in2020 International Symposium on Networks, Computers and Communications (ISNCC), 2020, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/9297343

  28. [28]

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,

    M. Alibeigi, W. Ljungbergh, A. Tonderski, G. Hess, A. Lilja, C. Lindstrom, D. Motorniuk, J. Fu, J. Widahl, and C. Petersson, “Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02008

  29. [29]

    ITU-T Recommendation P.910: Subjective video quality assessment methods for multimedia applications,

    International Telecommunication Union, “ITU-T Recommendation P.910: Subjective video quality assessment methods for multimedia applications,” ITU-T, Tech. Rep. P.910, Oct. 2023

  30. [30]

    Converting video formats with ffmpeg,

    S. Tomar, “Converting video formats with ffmpeg,”Linux Journal, vol. 2006, no. 146, p. 10, 2006

  31. [31]

    React [computer software],

    Meta Platforms, Inc., “React [computer software],” javaScript library for building user interfaces. [Online]. Available: https://react.dev/ [32]Ophthalmic optics — Visual acuity testing — Standard optotype and its presentation, International Organization for Standardization (ISO) Std. ISO 8596:2017, 2017

  32. [32]

    Ishihara,Ishihara’s Tests for Colour-Blindness: 38 Plates Edition, complete 38 plates ed

    S. Ishihara,Ishihara’s Tests for Colour-Blindness: 38 Plates Edition, complete 38 plates ed. ed. Tokyo: Kanehara, 2013