arxiv: 2605.01272 · v1 · submitted 2026-05-02 · 💻 cs.CV · eess.IV

Recognition: unknown

GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment

Rajesh Sureddi , Shreshth Saini , Avinab Saha , Alan C. Bovik

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:59 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords gaming video qualitybenchmark datasetMOS ratingsmulti-codecvideo quality assessmentuser-generated contentH.264 H.265 AV1

0 comments

The pith

GameScope supplies the largest public dataset of gaming videos rated for quality across three codecs and two content types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GameScope to fill the gap in large, diverse subjective data for gaming video quality assessment. It contains 4,048 samples drawn from both user-generated and professional content, each encoded with H.264, H.265, or AV1 and scored by an average of 37 raters. In addition to overall mean opinion scores, the dataset supplies coarse-grained quality attributes that reveal which perceptual factors matter most. The authors then benchmark leading video quality methods on the data and find that a vision-language model exceeds all others. This resource is intended to support the creation of assessment models that remain reliable no matter which codec a streaming platform chooses.

Core claim

The authors present GameScope as the first dataset that comprehensively addresses gaming video quality assessment across multiple codecs and content types with quality attributes, consisting of 4,048 video samples each annotated by an average of 37 mean opinion score ratings plus coarse-grained perceptual attributes.

What carries the argument

The GameScope dataset itself, a collection of 4,048 annotated gaming videos spanning UGC and PGC content, three codecs, and both overall MOS scores and coarse-grained quality attributes.

If this is right

Video quality models can now be evaluated for consistent performance across H.264, H.265, and AV1 encodings.
Coarse-grained attributes make it possible to isolate which visual factors drive perceived quality in games.
A vision language model outperforms traditional no-reference and full-reference benchmarks on this data.
Streaming platforms gain a shared testbed for comparing codec choices under realistic gaming content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms that adopt this dataset for training could reduce visible artifacts when switching codecs mid-stream.
The attribute annotations may support new research on how specific distortions affect player experience rather than just overall appeal.
Future expansions could add temporal or interactive quality metrics that current static ratings do not capture.

Load-bearing premise

The collected subjective MOS ratings and coarse-grained attributes are reliable, consistent across raters, and representative of real-world perceptual quality for diverse gaming content.

What would settle it

A follow-up study that re-rates a random subset of the videos with a fresh group of raters and obtains markedly different average scores, or that shows quality prediction models trained on GameScope perform no better than chance on an independent set of gaming streams.

read the original abstract

The development of video game streaming has grown rapidly, with major platforms such as YouTube and Twitch using different codecs. To support quality assessment models that work consistently across any codec, it is necessary to have access to large, diverse subjective gaming quality datasets. Currently, there are only a few available, each having limitations. To address this gap, we present the largest gaming video quality dataset to date, incorporating both user-generated content (UGC) and professional-generated content (PGC) with extensive visual diversity. Our dataset covers the most widely used codecs - H.264, H.265, and AV1 - and consists of 4,048 video samples, each annotated by an average of 37 mean opinion score (MOS) ratings. In addition to overall quality scores, we collect coarse-grained quality attributes, enabling a better understanding of perceptual factors. We study the performance of leading video quality assessment methods on this dataset, including a vision language model that outperforms all the benchmarks. To the best of our knowledge, this is the first dataset that comprehensively addresses gaming video quality assessment across multiple codecs and content types with quality attributes. Our dataset is publicly available at https://rajeshsureddi.github.io/GameScope/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents GameScope, a dataset of 4,048 gaming video samples (UGC and PGC) encoded with H.264, H.265, and AV1, each annotated with an average of 37 MOS ratings plus coarse-grained quality attributes. It benchmarks leading VQA methods, reports that a vision-language model outperforms the others, and claims to be the first comprehensive multi-codec, multi-attribute gaming VQA dataset, with public release at the provided URL.

Significance. If the subjective ratings are shown to be reliable and the content selection demonstrably diverse, the dataset would provide a valuable public benchmark for developing codec-agnostic gaming VQA models, addressing limitations in prior smaller or single-codec collections.

major comments (3)

[Methodology / Subjective Study] The methodology section provides no inter-rater reliability statistics (ICC, Krippendorff’s alpha, or outlier rejection rates) despite an average of 37 ratings per sample; without these, the stability of the MOS values and coarse attributes cannot be verified, directly weakening the claim that the dataset is a trustworthy benchmark.
[Dataset Construction] Content selection criteria, diversity metrics (e.g., genre distribution, motion complexity, resolution statistics), and any quantitative validation of “extensive visual diversity” are not reported; this leaves the “comprehensive” and “largest to date” assertions unsupported by evidence.
[Benchmarking Experiments] The benchmark comparison showing the VLM outperforming other methods lacks statistical significance tests, confidence intervals, or cross-validation details; the superiority claim therefore rests on point estimates alone.

minor comments (2)

[Introduction] The abstract and introduction repeat the “to the best of our knowledge” claim without a concise related-work table comparing dataset sizes, codec coverage, and attribute granularity against prior gaming VQA collections.
[Results] Figure captions and axis labels in the benchmark plots should explicitly state the number of test samples and whether results are averaged over multiple runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects for improving the clarity and rigor of our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of the dataset's reliability, diversity, and experimental validation.

read point-by-point responses

Referee: [Methodology / Subjective Study] The methodology section provides no inter-rater reliability statistics (ICC, Krippendorff’s alpha, or outlier rejection rates) despite an average of 37 ratings per sample; without these, the stability of the MOS values and coarse attributes cannot be verified, directly weakening the claim that the dataset is a trustworthy benchmark.

Authors: We agree that inter-rater reliability metrics are essential for establishing the trustworthiness of the MOS values. Although these statistics were not included in the original submission, the raw rating data are available to the authors. In the revised manuscript, we will add ICC, Krippendorff’s alpha, and details on outlier rejection procedures (e.g., any ratings more than two standard deviations from the mean) to the methodology section. These additions will directly address the concern and support the stability of the annotations. revision: yes
Referee: [Dataset Construction] Content selection criteria, diversity metrics (e.g., genre distribution, motion complexity, resolution statistics), and any quantitative validation of “extensive visual diversity” are not reported; this leaves the “comprehensive” and “largest to date” assertions unsupported by evidence.

Authors: We acknowledge that quantitative evidence for content diversity and selection criteria was not explicitly provided. We will revise the dataset construction section to include: (1) detailed content selection criteria (e.g., sampling strategy across UGC/PGC, game genres, and visual complexity), (2) diversity metrics such as genre distribution tables, motion complexity statistics (e.g., average optical flow magnitude), resolution and frame-rate histograms, and (3) quantitative validation supporting the claims of extensive visual diversity. These additions will provide the necessary evidence for the dataset's scope and comprehensiveness. revision: yes
Referee: [Benchmarking Experiments] The benchmark comparison showing the VLM outperforming other methods lacks statistical significance tests, confidence intervals, or cross-validation details; the superiority claim therefore rests on point estimates alone.

Authors: We recognize that relying solely on point estimates limits the strength of the performance claims. In the revised benchmarking experiments section, we will include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with appropriate corrections), 95% confidence intervals for all reported metrics, and clarification on any cross-validation procedures used. This will provide a more rigorous basis for the observed outperformance of the vision-language model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset contribution paper

full rationale

The paper introduces a new multi-codec gaming video quality dataset with 4,048 samples and MOS ratings but contains no mathematical derivations, equations, predictions, or self-referential fitting. Its central claim rests on the existence, diversity, and public release of the collected data rather than any reduction to inputs by construction. No load-bearing steps involve self-definition, fitted parameters renamed as predictions, or self-citation chains that justify uniqueness theorems. The skeptic's concern about missing inter-rater statistics addresses data reliability, not circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about subjective testing validity and dataset representativeness rather than new axioms or invented entities.

axioms (1)

domain assumption Standard subjective video quality assessment protocols (MOS collection with multiple raters) produce reliable and generalizable perceptual scores.
Invoked implicitly when reporting average 37 ratings per sample and using them to benchmark methods.

pith-pipeline@v0.9.0 · 5532 in / 1166 out tokens · 36428 ms · 2026-05-09T14:59:19.462565+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 3 internal anchors

[1]

GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment

INTRODUCTION Video games are among the most popular forms of entertain- ment worldwide. The global gaming population is projected to see user penetration rise from 33.57% to 37.39%, reach- ing an estimated 3.04 billion users by 2030 [1]. Since these gamers are the primary source of streaming content, this projection directly reflects the expanding scale o...

work page internal anchor Pith review Pith/arXiv arXiv 2030
[2]

TLVQM [8] represents one of the earliest models designed for assessing consumer video qual- ity in no-reference (NR) settings

RELA TED WORK Numerous general-purpose video quality metrics have been proposed in recent years. TLVQM [8] represents one of the earliest models designed for assessing consumer video qual- ity in no-reference (NR) settings. Subsequent approaches, such as RAPIQUE [9], combine Natural Scene Statistics (NSS) and Convolutional Neural Network (CNN) features, p...
[3]

very fast

DA TASET The compiled dataset comprises 424 source content clips col- lected from 74 games, each with a duration of ten seconds. The video material was acquired from two distinct sources to represent varying quality tiers. The first subset constitutes User-Generated Content (UGC) covering widely played ti- tles, obtained from YouTube under Creative Common...
[4]

Mas- ter

SUBJECTIVE STUDY To align with the objective of characterizing gaming video quality in streaming scenarios, we conducted an online sub- jective study via the Amazon Mechanical Turk (AMT) plat- form. Experimental stimuli were organized into batches us- ing Human Intelligence Tasks (HITs). Each batch was struc- tured to contain a single source sequence alon...

2000
[5]

To obtain the MOS on each video, a general procedure is to average all raw opinion scores

ANALYSIS OF SUBJECTIVE DA TA Using the above subjective rating procedure, we collected 150,874 ratings with an average of 37 per video. To obtain the MOS on each video, a general procedure is to average all raw opinion scores. Specifically, given a video (v), and the number N of ratings for that video, with each subject raw score denoted asr i, then M OS(...
[6]

We conducted a large-scale subjective study on AMT to derive reliable Mean Opinion Scores (MOS)

CONCLUSION We introduced the largest gaming video quality dataset to date, comprising both UGC and PGC across three codecs (H.264, H.265, and A V1) with extensive variations in content and resolution. We conducted a large-scale subjective study on AMT to derive reliable Mean Opinion Scores (MOS). Beyond standard scalar ratings, we collected granular quali...
[7]

Games-worldwide,

Statista Market Insights, “Games-worldwide,” 2025

2025
[8]

No-reference video quality estimation based on machine learning for passive gaming video streaming applications,

Nabajeet Barman, Emmanuel Jammeh, Seyed Ali Ghorashi, and Maria G Martini, “No-reference video quality estimation based on machine learning for passive gaming video streaming applications,”IEEE Access, vol. 7, pp. 74511–74527, 2019

2019
[9]

Perceptual quality assessment of UGCgaming videos,

Xiangxu Yu, Zhengzhong Tu, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik, “Perceptual quality assessment of UGCgaming videos,”arXiv preprint arXiv:2204.00128, 2022

work page arXiv 2022
[10]

GamingVideoSET: a dataset for gaming video streaming applications,

Nabajeet Barman, Saman Zadtootaghaj, Steven Schmidt, Maria G Martini, and Sebastian M ¨oller, “GamingVideoSET: a dataset for gaming video streaming applications,” in2018 16th Annual Workshop on Network and Systems Support for Games (NetGames). IEEE, 2018, pp. 1–6

2018
[11]

Subjective and objective analy- sis of streamed gaming videos,

Xiangxu Yu, Zhenqiang Ying, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik, “Subjective and objective analy- sis of streamed gaming videos,”IEEE Transactions on Games, vol. 16, no. 2, pp. 445–458, 2023

2023
[12]

Study of subjective and objective quality assessment of mobile cloud gaming videos,

Avinab Saha, Yu-Chih Chen, Chase Davis, Bo Qiu, Xiaom- ing Wang, Rahul Gowda, Ioannis Katsavounidis, and Alan C Bovik, “Study of subjective and objective quality assessment of mobile cloud gaming videos,”IEEE Transactions on Image Processing, vol. 32, pp. 3295–3310, 2023

2023
[13]

GAMIV AL: video quality prediction on mobile cloud gaming content,

Yu-Chih Chen, Avinab Saha, Chase Davis, Bo Qiu, Xiaom- ing Wang, Rahul Gowda, Ioannis Katsavounidis, and Alan C Bovik, “GAMIV AL: video quality prediction on mobile cloud gaming content,”IEEE Signal Processing Letters, vol. 30, pp. 324–328, 2023

2023
[14]

Two-level approach for no-reference con- sumer video quality assessment,

Jari Korhonen, “Two-level approach for no-reference con- sumer video quality assessment,”IEEE Transactions on Image Processing, vol. 28, no. 12, pp. 5923–5938, 2019

2019
[15]

Efficient user-generated video quality prediction,

Zhengzhong Tu, Chia-Ju Chen, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C Bovik, “Efficient user-generated video quality prediction,” in2021 Picture Coding Symposium (PCS). IEEE, 2021, pp. 1–5

2021
[16]

UGC-VQA: benchmarking blind video quality assessment for user generated content,

Zhengzhong Tu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C Bovik, “UGC-VQA: benchmarking blind video quality assessment for user generated content,”IEEE Transac- tions on Image Processing, vol. 30, pp. 4449–4464, 2021

2021
[17]

Quality assess- ment of in-the-wild videos,

Dingquan Li, Tingting Jiang, and Ming Jiang, “Quality assess- ment of in-the-wild videos,” inProceedings of the 27th ACM international conference on multimedia, 2019, pp. 2351–2359

2019
[18]

Fast- VQA: efficient end-to-end video quality assessment with frag- ment sampling,

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, An- nan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Fast- VQA: efficient end-to-end video quality assessment with frag- ment sampling,” inEuropean conference on computer vision. Springer, 2022, pp. 538–554

2022
[19]

Patch-VQ:’patching up’the video quality problem,

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik, “Patch-VQ:’patching up’the video quality problem,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2021, pp. 14019–14029

2021
[20]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,” inPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, 2023, pp. 20144–20154

2023
[21]

NDNetGaming - Develop- ment of a No-Reference Deep CNN for Gaming Video Quality Prediction,

Markus Utke, Saman Zadtootaghaj, Steven Schmidt, Sebastian Bosse, and Sebastian Moeller, “NDNetGaming - Develop- ment of a No-Reference Deep CNN for Gaming Video Quality Prediction,” inMultimedia Tools and Applications. Springer, 2020

2020
[22]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al., “LLaV A-OneVision-1.5: fully open frame- work for democratized multimodal training,”arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review arXiv 2025
[23]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

arXiv preprint arXiv:2312.17090 (2023)

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al., “Q-Align: teaching lmms for vi- sual scoring via discrete text-defined levels,”arXiv preprint arXiv:2312.17090, 2023

work page arXiv 2023
[25]

VQA-Thinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, and Xiongkuo Min, “VQA-Thinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,”arXiv preprint arXiv:2508.06051, 2025

work page arXiv 2025
[26]

Subjective video quality assessment methods for multimedia applications,

ITU-T, “Subjective video quality assessment methods for multimedia applications,” ITU-T Recommendation P.910, apr 2008

2008
[27]

Broadcasting guidelines,

Twitch, “Broadcasting guidelines,” 2025

2025
[28]

A simple model for sub- ject behavior in subjective experiments,

Zhi Li, Christos G Bampis, Luk ´aˇs Krasula, Lucjan Janowski, and Ioannis Katsavounidis, “A simple model for sub- ject behavior in subjective experiments,”arXiv preprint arXiv:2004.02067, 2020

work page arXiv 2004