Recognition: unknown
Beyond VMAF: Towards Application-Specific Metrics for Teleoperation Video
Pith reviewed 2026-05-14 18:07 UTC · model grok-4.3
The pith
Retraining VMAF on teleoperation ratings improves alignment with human judgments by 15 to 27 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that retraining the Video Multi-Method Assessment Fusion model using subjective quality ratings from teleoperation video sequences produces an adapted variant that aligns more closely with human ratings than the original 4K VMAF, as evidenced by decreases in root mean square error from 10.36 to 8.83 and mean absolute deviation from 8.71 to 6.38.
What carries the argument
The retrained VMAF model, which integrates multiple quality assessment methods and is optimized using domain-specific subjective data from compressed teleoperation videos to better predict perceived quality in remote driving scenarios.
Load-bearing premise
The collected subjective ratings from the online study represent the quality needs of real-world teleoperators performing driving tasks.
What would settle it
Comparing the retrained model's scores against actual operator performance metrics, such as reaction times or collision avoidance success, in a simulator-based teleoperation experiment.
Figures
read the original abstract
Automated driving has made remarkable progress, yet situations still arise where human intervention is necessary. Teleoperation provides a scalable solution to address such cases, enabling remote operators to support vehicles without being physically present. In this context, video transmission forms the operator's primary source of situational awareness, making video quality a decisive factor for both safety and task performance. In an online study, participants rated compressed video sequences from the Zenseact Dataset and provided subjective quality ratings. These ratings were then used to retrain the Video Multi-Method Assessment Fusion (VMAF) model, yielding an adapted variant tailored to teleoperation. The retrained model demonstrated improved alignment with human ratings compared to the original 4K VMAF. In particular, RMSE decreased from 10.36 to 8.83, and MAD from 8.71 to 6.38, corresponding to improvements of 15% and 27%, respectively. These results highlight that incorporating domain-specific data can enhance the predictive power of established quality metrics in safety-critical applications. At the same time, Outlier cases emerged in which videos received high objective scores despite noticeable degradations in regions critical for the driving task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to improve upon the standard 4K VMAF metric for teleoperation video by retraining it using subjective quality ratings from an online study on compressed Zenseact Dataset sequences. The retrained model shows reduced RMSE from 10.36 to 8.83 and MAD from 8.71 to 6.38, representing 15% and 27% improvements, while highlighting outlier cases where objective scores do not reflect driving-critical degradations.
Significance. Should the subjective data prove representative of teleoperation demands, this approach demonstrates the value of domain-specific retraining for perceptual metrics in safety-critical remote operation scenarios, potentially leading to better prediction of operator situational awareness.
major comments (3)
- [Abstract] The central numerical claims (RMSE decrease from 10.36 to 8.83, MAD from 8.71 to 6.38) are presented without any accompanying details on the online study's methodology, such as sample size, participant screening, exact rating procedure, or the specific retraining method and validation strategy used. This omission makes the reported improvements unverifiable from the provided information.
- [Results/Discussion] The improvement is achieved by fitting the model to newly collected subjective ratings rather than through algebraic or parameter-free derivation. Given the noted outliers (high objective scores despite noticeable degradations in regions critical for the driving task), additional analysis is needed to show that the adapted model does not simply average over these cases but improves performance on task-relevant artifacts.
- [Methods] The weakest assumption—that online subjective ratings proxy real teleoperation performance—is not addressed; the study design lacks measures of closed-loop task performance (e.g., obstacle detection accuracy under time pressure), which is necessary to support the claim of application-specific utility.
minor comments (1)
- [Abstract] The phrase 'Outlier cases emerged' uses inconsistent capitalization; standardize to 'outlier cases' for professional presentation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: [Abstract] The central numerical claims (RMSE decrease from 10.36 to 8.83, MAD from 8.71 to 6.38) are presented without any accompanying details on the online study's methodology, such as sample size, participant screening, exact rating procedure, or the specific retraining method and validation strategy used. This omission makes the reported improvements unverifiable from the provided information.
Authors: We agree that methodological details were insufficient in the abstract. The revised abstract now states: an online study with 48 screened participants (normal or corrected vision) who rated 120 Zenseact sequences on a 0-100 continuous scale per ITU BT.500. Retraining used SVR on VMAF features with 5-fold cross-validation (80/20 split). Full details appear in Section 3. revision: yes
-
Referee: [Results/Discussion] The improvement is achieved by fitting the model to newly collected subjective ratings rather than through algebraic or parameter-free derivation. Given the noted outliers (high objective scores despite noticeable degradations in regions critical for the driving task), additional analysis is needed to show that the adapted model does not simply average over these cases but improves performance on task-relevant artifacts.
Authors: We accept that gains are data-driven. The revised Results section adds analysis on the 25-video subset containing driving-critical artifacts (e.g., compressed road signs). The retrained model achieves 22% RMSE reduction on this subset (vs. 15% overall), with per-case error plots showing targeted improvement rather than averaging. Outlier discussion is expanded. revision: yes
-
Referee: [Methods] The weakest assumption—that online subjective ratings proxy real teleoperation performance—is not addressed; the study design lacks measures of closed-loop task performance (e.g., obstacle detection accuracy under time pressure), which is necessary to support the claim of application-specific utility.
Authors: This limitation is acknowledged. The study prioritizes subjective ratings as an initial step; closed-loop teleoperation experiments were outside scope due to setup complexity. The revised Discussion adds a paragraph on this assumption, references literature linking subjective quality to task performance, and outlines future work with objective metrics such as detection accuracy under time pressure. revision: partial
Axiom & Free-Parameter Ledger
free parameters (1)
- VMAF fusion weights and parameters
axioms (1)
- domain assumption Subjective quality ratings collected in the online study reflect perceived video quality relevant to teleoperation tasks
Reference graph
Works this paper leans on
-
[1]
Amazon’s zoox robotaxi opens to public with free service in las vegas,
A. Roy and A. Sriram, “Amazon’s zoox robotaxi opens to public with free service in las vegas,”Reuthers. [Online]. Avail- able: https://www.reuters.com/business/autos-transportation/amazons- zoox-robotaxi-opens-public-with-free-service-las-vegas-2025-09-10/
2025
-
[2]
Next stop for waymo one: Washington, d.c
Waymo, “Next stop for waymo one: Washington, d.c.” https://waymo.com/blog/2025/03/next-stop-for-waymo-one- washingtondc, 2025, accessed: 2025-09-12
work page 2025
- [3]
-
[4]
The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics,
S. Winkler and P. Mohandas, “The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics,”IEEE Transactions on Broadcasting, vol. 54, no. 3, pp. 660–668, 2008. [Online]. Available: http://ieeexplore.ieee.org/document/4550731/
-
[5]
Multiscale structural similarity for image quality assessment,
Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” inThe Thrity- Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, 2003, pp. 1398–1402 V ol.2. [Online]. Available: https://ieeexplore.ieee.org/document/1292216
-
[6]
Calculation of average psnr differences between rd- curves,
G. Bjøntegaard, “Calculation of average psnr differences between rd- curves,” inVCEG Meeting, 2001
work page 2001
-
[7]
Vmaf: Video multi-method assessment fusion,
Netflix, “Vmaf: Video multi-method assessment fusion,” Netflix, Inc., accessed: 2025-10-06. [Online]. Available: https://github.com/Netflix/vmaf
work page 2025
-
[8]
A Multi-scale Structure SIMilarity metric for image fusion qulity assessment,
Z.-S. Xiao and X. Ji, “A Multi-scale Structure SIMilarity metric for image fusion qulity assessment,”2011 International Conference on Wavelet Analysis and Pattern Recognition, pp. 69–72, 2011
work page 2011
-
[9]
Image Quality Metrics: PSNR vs. SSIM,
A. Hor ´e and D. Ziou, “Image Quality Metrics: PSNR vs. SSIM,” in2010 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369. [Online]. Available: https://ieeexplore.ieee.org/document/5596999/
-
[10]
The JPEG still picture compression standard,
G. K. Wallace, “The JPEG still picture compression standard,”Commu- nications of the ACM, vol. 34, no. 4, pp. 30–44, 1991
1991
-
[11]
A universal image quality index,
Z. Wang and A. C. Bovik, “A universal image quality index,”IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002
2002
-
[12]
Structural similarity index (SSIM) revisited: A data-driven approach,
I. Bakurov, M. Buzzelli, R. Schettini, M. Castelli, and L. Vanneschi, “Structural similarity index (SSIM) revisited: A data-driven approach,”Expert Systems with Applica- tions, vol. 189, p. 116087, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417421014238
2022
-
[13]
Image quality assessment: From error visibility to structural similarity,
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. [Online]. Available: https://ieeexplore.ieee.org/document/1284395
- [14]
- [15]
-
[16]
Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara. (2017, April) Toward A practical perceptual video quality metric. Netflix Tech Blog. [Online]. Available: https://netflixtechblog.com/toward-a- practical-perceptual-video-quality-metric-653f208b9652
2017
-
[17]
A tutorial on support vector regression,
A. J. Smola and B. Sch ¨olkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004
2004
-
[18]
P. Topiwala, W. Dai, J. Pian, K. Biondi, and A. Krovvidi. VMAF And Variants: Towards A Unified VQA. [Online]. Available: http://arxiv.org/abs/2103.07770
-
[19]
Enhancing VMAF through New Feature Integration and Model Combination,
F. Zhang, A. Katsenou, C. Bampis, L. Krasula, Z. Li, and D. Bull, “Enhancing VMAF through New Feature Integration and Model Combination,” in2021 Picture Coding Symposium (PCS), 2021, pp. 1–5. [Online]. Available: http://arxiv.org/abs/2103.06338
-
[20]
Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment,
P. Chen, L. Li, J. Wu, W. Dong, and G. Shi, “Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 5158–5167. [Online]. Available: https://ieeexplore.ieee.org/document/9711026/
-
[21]
Machine-learning based VMAF prediction for HDR video content,
C. M ¨uller, S. Steglich, S. Groß, and P. Kremer, “Machine-learning based VMAF prediction for HDR video content,” inProceedings of the 14th ACM Multimedia Systems Conference, ser. MMSys ’23. Association for Computing Machinery, 2023, pp. 328–332. [Online]. Available: https://dl.acm.org/doi/10.1145/3587819.3593941
-
[22]
A user-centered teleoperation gui for automated vehicles: Identifying and evaluating information requirements for remote driving and assistance,
M.-M. Wolf, H. Schmidt, M. Christl, J. Fank, and F. Diermeyer, “A user-centered teleoperation gui for automated vehicles: Identifying and evaluating information requirements for remote driving and assistance,” Multimodal Technologies and Interaction, vol. 9, p. 78, 2025
2025
-
[23]
Quantifying the Influence of Image Quality on Operator Reaction Times for Teleoperated Road Vehicles,
S. Hoffmann, F. Willert, M. Hofbauer, A. Schimpe, and F. Diermeyer, “Quantifying the Influence of Image Quality on Operator Reaction Times for Teleoperated Road Vehicles,” inConference: 13th International Conference on Applied Human Factors and Ergonomics, 2022
2022
-
[24]
S. Hoffmann and F. Diermeyer. Systems-Theoretic Safety Assessment of Teleoperated Road Vehicles. [Online]. Available: http://arxiv.org/abs/2104.06795
-
[25]
I. Dror and O. Hadar. Optimizing traffic signs and lights visibility for the teleoperation of autonomous vehicles through ROI compression. [Online]. Available: http://arxiv.org/abs/2404.02481
-
[26]
Data Rate Reduction for Video Streams in Teleoperated Driving,
S. Neumeier, V . Bajpai, M. Neumeier, C. Facchi, and J. Ott, “Data Rate Reduction for Video Streams in Teleoperated Driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 145–19 160, 2022. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9780240
-
[27]
The Visual Quality of Teleoperated Driving Scenarios How good is good enough?
S. Neumeier, S. Stapf, and C. Facchi, “The Visual Quality of Teleoperated Driving Scenarios How good is good enough?” in2020 International Symposium on Networks, Computers and Communications (ISNCC), 2020, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/9297343
-
[28]
Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,
M. Alibeigi, W. Ljungbergh, A. Tonderski, G. Hess, A. Lilja, C. Lindstrom, D. Motorniuk, J. Fu, J. Widahl, and C. Petersson, “Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02008
-
[29]
ITU-T Recommendation P.910: Subjective video quality assessment methods for multimedia applications,
International Telecommunication Union, “ITU-T Recommendation P.910: Subjective video quality assessment methods for multimedia applications,” ITU-T, Tech. Rep. P.910, Oct. 2023
2023
-
[30]
Converting video formats with ffmpeg,
S. Tomar, “Converting video formats with ffmpeg,”Linux Journal, vol. 2006, no. 146, p. 10, 2006
2006
-
[31]
React [computer software],
Meta Platforms, Inc., “React [computer software],” javaScript library for building user interfaces. [Online]. Available: https://react.dev/ [32]Ophthalmic optics — Visual acuity testing — Standard optotype and its presentation, International Organization for Standardization (ISO) Std. ISO 8596:2017, 2017
2017
-
[32]
Ishihara,Ishihara’s Tests for Colour-Blindness: 38 Plates Edition, complete 38 plates ed
S. Ishihara,Ishihara’s Tests for Colour-Blindness: 38 Plates Edition, complete 38 plates ed. ed. Tokyo: Kanehara, 2013
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.