arxiv: 2604.08703 · v1 · submitted 2026-04-09 · 💻 cs.MM · cs.DB· cs.LG

Recognition: unknown

QoS-QoE Translation with Large Language Model

Ahmadreza Eslaminia, Kaizhuo Yan, Klara Nahrstedt, Lingzhi Zhao, Mingyuan Wu, Yingjie Yu

Pith reviewed 2026-05-10 16:57 UTC · model grok-4.3

classification 💻 cs.MM cs.DBcs.LG

keywords QoS-QoE translationlarge language modelsvideo streamingmultimedia systemsdataset constructionsupervised fine-tuningquality prediction

0 comments

The pith

A literature-extracted dataset enables large language models to perform accurate bidirectional translation between QoS metrics and QoE scores after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior QoS-QoE studies produced scattered findings tied to specific setups, which limited systematic reuse and cross-scenario application. The paper addresses this by creating a structured dataset of QoS-QoE relationships drawn from multimedia literature, mainly video streaming, through an automated pipeline of paper curation, relationship extraction, and iterative validation. Each entry includes the extracted mapping, parameter definitions, supporting evidence, and context. When large language models are fine-tuned on this dataset they demonstrate strong results on both continuous-value regression and discrete-label classification, and they handle translation in both directions. The dataset is released publicly to support benchmarking and future LLM-driven quality prediction and optimization in multimedia systems.

Core claim

The QoS-QoE Translation dataset, built automatically from multimedia papers, equips large language models to achieve strong performance on continuous-value and discrete-label prediction for both QoS-to-QoE and QoE-to-QoS translation after supervised fine-tuning.

What carries the argument

The QoS-QoE Translation dataset, which stores structured relationships along with parameter definitions, evidence, and metadata, used as training data for LLM fine-tuning on bidirectional translation tasks.

If this is right

Enables systematic reuse of prior QoS-QoE findings across different experimental setups.
Supports large-scale analysis and cross-scenario generalization in multimedia quality modeling.
Provides a benchmark for evaluating LLMs on QoS-QoE and QoE-QoS tasks.
Facilitates LLM-based reasoning pipelines for multimedia quality prediction and optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time streaming systems could use the fine-tuned model to adjust parameters dynamically based on predicted user experience.
The same extraction-plus-fine-tuning method could be applied to other domains where relationships are scattered across papers.
Continuous-value outputs might allow finer-grained control loops than discrete labels alone.

Load-bearing premise

The automated pipeline of paper curation, relationship extraction, and iterative evaluation captures valid QoS-QoE relationships from the literature without major extraction errors or selection biases.

What would settle it

Compare the fine-tuned LLM's predicted QoE values or labels against new experimental measurements collected under conditions absent from the original dataset.

Figures

Figures reproduced from arXiv: 2604.08703 by Ahmadreza Eslaminia, Kaizhuo Yan, Klara Nahrstedt, Lingzhi Zhao, Mingyuan Wu, Yingjie Yu.

**Figure 1.** Figure 1: Overview of the QoS-QoE Translation dataset construction pipeline. The pipeline begins with paper curation, followed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: An example JSON record from the QoS-QoE Trans [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset analysis of QoS-QoE Translation. The first row shows key metadata distributions, including protocol, network [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos-qoe-translation-page/, for full reproducibility and open access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a public dataset of QoS-QoE relationships from the literature and tests LLMs on bidirectional translation, but the automated extraction step lacks any reported accuracy checks.

read the letter

The main takeaway is that this work releases a new structured dataset of QoS-QoE mappings focused on video streaming, built by pulling relationships out of existing papers, and then shows LLMs can be fine-tuned to predict in both directions between QoS values and QoE labels or scores. The dataset and code are public, which is the clearest practical step forward here. Prior studies on these mappings have been scattered and hard to reuse, so collecting them into one source-grounded resource addresses a real friction point for people doing multimedia systems work. The LLM experiments add a layer by testing zero-shot versus fine-tuned performance on continuous and discrete tasks, which fits the current interest in applying language models to domain data. That part is straightforward and reproducible by design. The extraction pipeline is the weak point. The description covers paper curation, automated relationship pulling, and iterative evaluation, yet no precision or recall numbers, no sample of human-checked extractions, and no error analysis on how well parameter definitions or evidence match the originals appear in the write-up. Without that, it is hard to know how much noise is in the training signal or whether the reported LLM performance reflects real relationships or artifacts from the pipeline. If extraction errors are common, the fine-tuning results become less informative. The performance claims themselves stay high-level in the abstract with no baselines, splits, or variance details shown. This paper is mainly for researchers in multimedia QoE, network systems, or LLM applications to technical prediction tasks who need a starting dataset rather than a finished model. A reader who wants to benchmark or extend LLM use in this area could get value from the released resource even if the current results need tightening. It deserves peer review because the dataset construction is a concrete, shareable contribution that others can test and improve, though reviewers will likely press on the validation gap and ask for more concrete metrics on both the data quality and the model results.

Referee Report

2 major / 0 minor

Summary. The paper introduces the QoS-QoE Translation dataset, a structured collection of QoS-QoE relationships extracted from the multimedia literature (with emphasis on video streaming) via an automated pipeline of paper curation, relationship extraction, and iterative evaluation. Each entry includes the relationship, parameter definitions, supporting evidence, and metadata. The authors then evaluate LLMs on bidirectional QoS-to-QoE and QoE-to-QoS translation tasks for both continuous-value regression and discrete-label classification, reporting strong performance both before and after supervised fine-tuning on the new dataset. The dataset and code are released publicly.

Significance. If the extraction pipeline faithfully captures literature relationships and the LLM evaluations are rigorously validated, the work would provide a useful open resource for multimedia quality modeling and could accelerate LLM-based reasoning for QoS-QoE mapping. The public release of data and code is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract / Dataset Construction] Abstract and dataset-construction description: the automated pipeline (paper curation + QoS-QoE relationship extraction + iterative evaluation) is load-bearing for all downstream claims, yet no quantitative validation is reported (precision/recall against source papers, human verification sample size, inter-annotator agreement, or error analysis on parameter definitions and evidence spans). Without these, both dataset fidelity and the fine-tuning signal remain unverifiable.
[LLM Evaluation] LLM evaluation section: the headline claim of 'strong performance' on continuous and discrete bidirectional translation lacks any reported metrics, baselines, data splits, error bars, statistical tests, or ablation on fine-tuning details. This prevents assessment of whether the results exceed trivial baselines or are robust to evaluation choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity and rigor that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / Dataset Construction] Abstract and dataset-construction description: the automated pipeline (paper curation + QoS-QoE relationship extraction + iterative evaluation) is load-bearing for all downstream claims, yet no quantitative validation is reported (precision/recall against source papers, human verification sample size, inter-annotator agreement, or error analysis on parameter definitions and evidence spans). Without these, both dataset fidelity and the fine-tuning signal remain unverifiable.

Authors: We agree that quantitative validation of the automated pipeline is necessary to support claims about dataset quality. The manuscript describes the pipeline steps and notes that iterative evaluation was performed, but we did not report specific metrics such as precision/recall or human verification statistics. In the revised manuscript we will add a dedicated validation subsection that includes: (i) precision and recall computed against a held-out set of manually annotated source papers, (ii) the size of the human verification sample, (iii) inter-annotator agreement (Cohen’s kappa), and (iv) a brief error analysis focused on parameter definitions and evidence spans. revision: yes
Referee: [LLM Evaluation] LLM evaluation section: the headline claim of 'strong performance' on continuous and discrete bidirectional translation lacks any reported metrics, baselines, data splits, error bars, statistical tests, or ablation on fine-tuning details. This prevents assessment of whether the results exceed trivial baselines or are robust to evaluation choices.

Authors: We acknowledge that the evaluation section would benefit from greater explicitness. While the manuscript reports performance numbers for both pre-trained and fine-tuned LLMs on the bidirectional tasks, we will expand it to include: (i) all concrete metrics (MAE, RMSE for regression; accuracy, F1 for classification), (ii) explicit baselines (e.g., linear regression for continuous values and majority-class or random baselines for discrete labels), (iii) the exact train/validation/test splits used, (iv) error bars or standard deviations across multiple random seeds, (v) statistical significance tests (paired t-tests or Wilcoxon tests), and (vi) a short ablation table on key fine-tuning hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and LLM evaluation are self-contained

full rationale

The paper introduces a source-grounded QoS-QoE dataset built via automated curation/extraction/evaluation pipeline and then reports LLM performance (pre- and post-fine-tuning) on continuous/discrete bidirectional translation tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citations invoked as uniqueness theorems appear in the provided text. Performance claims rest on standard held-out evaluation rather than any reduction to inputs by construction. The skeptic concern about missing extraction validation is a correctness/validity issue, not circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on validity of automated literature extraction and LLM generalization from the resulting dataset; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Literature papers contain extractable, structured QoS-QoE relationships that can be reliably parsed by an automated pipeline.
The dataset construction depends on this assumption for accuracy and completeness.

pith-pipeline@v0.9.0 · 5574 in / 1274 out tokens · 73982 ms · 2026-05-10T16:57:36.177652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 26 canonical work pages · 5 internal anchors

[1]

Mohammed Alreshoodi and John Woods. 2013. Survey on QoE/QoS Correlation Models for Multimedia Services.International Journal of Distributed and Parallel Systems4, 3 (2013), 53–72. https://doi.org/10.5121/ijdps.2013.4305

work page doi:10.5121/ijdps.2013.4305 2013
[2]

Anthropic. 2025. Introducing Claude Haiku 4.5. https://www.anthropic.com/ news/claude-haiku-4-5 Official model announcement. Accessed: 2026-04-01

2025
[3]

Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and Hui Zhang. 2013. Developing a Predictive Model of Quality of Experience for Internet Video.ACM SIGCOMM Computer Communication Review43, 4 (2013), 339–350. https://doi.org/10.1145/2486001.2486025

work page doi:10.1145/2486001.2486025 2013
[4]

Martini, and Luigi Atzori

Alcardo Alex Barakabitze, Nabajeet Barman, Arslan Ahmad, Saman Zadtootaghaj, Lingfen Sun, Maria G. Martini, and Luigi Atzori. 2020. QoE Management of Multimedia Streaming Services in Future Networks: A Tutorial and Survey.IEEE Communications Surveys & Tutorials22, 1 (2020), 526–565. https://doi.org/10. 1109/COMST.2019.2958784

work page arXiv 2020
[5]

Nabajeet Barman and Maria G. Martini. 2019. QoE Modeling for HTTP Adaptive Video Streaming: A Survey and Open Challenges.IEEE Access7 (2019), 30831– 30859. https://doi.org/10.1109/ACCESS.2019.2901778

work page doi:10.1109/access.2019.2901778 2019
[6]

doi: 10.1038/s41467-024-45563-x

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models.Nature Communications15, 1 (2024), 1418. https://doi.org/10.1038/s41467-024-45563-x

work page doi:10.1038/s41467-024-45563-x 2024
[7]

Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph, Aditya Ganjam, Jibin Zhan, and Hui Zhang. 2011. Understanding the Impact of Video Quality on User Engagement.ACM SIGCOMM Computer Communication Review41, 4 (2011), 362–373. https://doi.org/10.1145/2018436.2018478

work page doi:10.1145/2018436.2018478 2011
[8]

Zhengfang Duanmu, Abdul Rehman, and Zhou Wang. 2018. A Quality-of- Experience Database for Adaptive Video Streaming.IEEE Transactions on Broad- casting64, 2 (June 2018), 474–487. https://doi.org/10.1109/TBC.2018.2822870

work page doi:10.1109/tbc.2018.2822870 2018
[9]

Google. 2025. Gemini 2.5 Flash-Lite. https://ai.google.dev/gemini-api/docs/ models/gemini-2.5-flash-lite Official model documentation. Accessed: 2026-04- 01

2025
[10]

Aaron Grattafiori, Abhimanyu Dubey, et al. 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783(2024). arXiv:2407.21783 https://arxiv.org/abs/ 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2024. A Survey on LLM-as- a-Judge.arXiv preprint arXiv:2411.15594(2024). https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Jiaxing Huang and Jingyi Zhang. 2024. A Survey on Evaluation of Mul- timodal Large Language Models.arXiv preprint arXiv:2408.15769(2024). arXiv:2408.15769 [cs.CV] https://arxiv.org/abs/2408.15769

work page arXiv 2024
[13]

2023.Roadmap for QoS and QoE in the ITU-T Study Group 12 Context

ITU-T. 2023.Roadmap for QoS and QoE in the ITU-T Study Group 12 Context. Technical Report GSTR-RQ. International Telecommunication Union. https: //www.itu.int/dms_pub/itu-t/opb/tut/T-TUT-QOS-2023-2-PDF-E.pdf

2023
[14]

Shunmuga Krishnan and Ramesh K

S. Shunmuga Krishnan and Ramesh K. Sitaraman. 2012. Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs. InProceedings of the 2012 Internet Measurement Conference (IMC ’12). Association for Computing Machinery, New York, NY, USA, 211–224. https://doi.org/10. 1145/2398776.2398799

work page arXiv 2012
[15]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. 2024. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.arXiv preprint arXiv:2311.17005(2024). https://arxiv.org/abs/2311.17005

work page arXiv 2024
[16]

Yanan Li, Guangqing Deng, Changming Bai, Jingyu Yang, Gang Wang, Hao Zhang, Jin Bai, Haitao Yuan, Mengwei Xu, and Shangguang Wang. 2023. Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet.Proceedings of the ACM on Management of Data1, 4, Article 236 (2023), 29 pages. https://doi.org/10.1145/3626723

work page doi:10.1145/3626723 2023
[17]

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, et al . 2024. VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents.arXiv preprint arXiv:2408.06327(2024). arXiv:2408.06327 [cs.CV] https://arxiv.org/abs/2408. 06327

work page arXiv 2024
[18]

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. TempCompass: Do Video LLMs Really Understand Videos?. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 8731–8772. https: //doi.org/10.18653/v1/2024.findings-acl.517

work page doi:10.18653/v1/2024.findings-acl.517 2024
[19]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self- Feedback.arXiv preprint arXiv:2303.17651(2023). ht...

work page internal anchor Pith review arXiv 2023
[20]

Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural Adaptive Video Streaming with Pensieve. InProceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). Association for Computing Machinery, New York, NY, USA, 197–210. https://doi.org/10.1145/ 3098822.3098843

work page arXiv 2017
[21]

OpenAI. 2025. Introducing deep research. https://openai.com/index/introducing- deep-research/ Accessed: 2026-03-26

2025
[22]

OpenAI. 2025. Introducing GPT-5.2. https://openai.com/index/introducing-gpt- 5-2/ Accessed: 2026-03-27

2025
[23]

Qwen Team. [n. d.]. Qwen3.5-35B-A3B. https://huggingface.co/Qwen/Qwen3.5- 35B-A3B Official model card. Accessed: 2026-03-27

2026
[24]

Michael Seufert, Sebastian Egger-Lampl, Martin Slanina, Thomas Zinner, Tobias Hoßfeld, and Phuoc Tran-Gia. 2015. A Survey on Quality of Experience of HTTP Adaptive Streaming.IEEE Communications Surveys & Tutorials17, 1 (2015), 469–492. https://doi.org/10.1109/COMST.2014.2360940

work page doi:10.1109/comst.2014.2360940 2015
[25]

Mahsa Shamsabadi, Jennifer D’Souza, and Sören Auer. 2024. Large Language Models for Scientific Information Extraction: An Empirical Study for Virology. arXiv preprint arXiv:2401.10040(2024). https://doi.org/10.48550/arXiv.2401.10040 arXiv:2401.10040

work page doi:10.48550/arxiv.2401.10040 2024
[26]

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. SALMONN: Towards Generic Hearing Abilities for Large Language Models.arXiv preprint arXiv:2310.13289(2023). arXiv:2310.13289 [cs.SD] https://arxiv.org/abs/2310.13289

work page arXiv 2023
[27]

Thinking Machines Lab. 2025. Tinker. https://thinkingmachines.ai/tinker/ Accessed: 2026-04-02

2025
[28]

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, et al. 2024. MinerU: An Open-Source Solution for Precise Document Content Extraction.arXiv preprint arXiv:2409.18839(2024). arXiv:2409.18839 [cs.CV] https://arxiv.org/abs/2409.18839

work page arXiv 2024
[29]

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2024. Do- cLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association ...

work page doi:10.18653/v1/2024.acl- 2024
[30]

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding. arXiv preprint arXiv:2407.15754(2024). https://doi.org/10.48550/arXiv.2407.15754 arXiv:2407.15754 [cs.CV]

work page doi:10.48550/arxiv.2407.15754 2024
[31]

xAI. [n. d.]. Grok 4.20 0309 Reasoning. https://docs.x.ai/developers/models/grok- 4.20-0309-reasoning Official model documentation. Accessed: 2026-04-01

2026
[32]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[33]

Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. 2015. A Control- Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Com- munication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA, 325–338. https://doi.org/10.1145/2785956.2787486

work page doi:10.1145/2785956.2787486 2015
[34]

Zehao Zhu, Wei Sun, Jun Jia, Wei Wu, Sibin Deng, Kai Li, Ying Chen, Xiongkuo Min, Jia Wang, and Guangtao Zhai. 2024. Subjective and Objec- tive Quality-of-Experience Evaluation Study for Live Video Streaming.arXiv preprint arXiv:2409.17596(2024). https://doi.org/10.48550/arXiv.2409.17596 arXiv:2409.17596 7

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.17596 2024