pith. machine review for the scientific record. sign in

arxiv: 2604.11082 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

Adri\'an Barahona-R\'ios, Ashley Wiens, Benedict Wilkins, Cor-Paul Bezemer, Nabajeet Barman, Saman Zadtootaghaj, Yakun Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual glitch detectionvideo gamesvision-language modelsreference-guided promptingquality assurancemulti-frame analysisframe comparisonautomated testing
0
0 comments X

The pith

Pairing each game frame with a reference frame from earlier in the same video improves vision-language model glitch detection at both frame and video levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RESP, a multi-frame prompting method that selects a reference frame from earlier in a gameplay video and feeds reference-test pairs to a vision-language model. This reframes glitch detection as a within-video comparison rather than isolated single-frame classification. The approach aggregates the resulting frame predictions into a video-level decision without any model fine-tuning. Experiments on one synthetic and two real datasets across five models show consistent gains in frame accuracy that carry through to more reliable video triage. The work targets the scaling problem of manual quality assurance as game test surfaces grow larger.

Core claim

Reference-guided sequential prompting, by selecting an earlier frame in the same video as a visual baseline and sequentially prompting the VLM with reference-test pairs, produces stronger frame-level glitch detection whose evidence aggregates into improved video-level decisions across multiple VLMs and both synthetic and real-world datasets.

What carries the argument

Reference-guided prompting, which pairs each test frame with a selected earlier frame from the same video to establish a baseline and convert detection into within-video comparison.

If this is right

  • Frame-level glitch detection accuracy rises when reference guidance is added to the prompting process.
  • Gains at the frame level transfer directly to more reliable video-level triage decisions under realistic conditions.
  • The method operates across five different VLMs with no requirement for fine-tuning.
  • Performance holds on both the synthetic RefGlitch dataset and two real-world gameplay collections.
  • Simple aggregation of the per-frame outputs produces stable video-level results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Game studios could integrate this style of reference prompting into existing automated pipelines to handle larger volumes of test footage.
  • The same within-video comparison idea might extend to detecting anomalies in other sequential visual data such as surveillance or robotics footage.
  • Combining the reference pairs with lightweight temporal smoothing could further reduce false positives from transient visual effects.
  • Developers might use the improved frame scores to rank and prioritize glitch reports for human review.

Load-bearing premise

A reference frame chosen from earlier in the video will reliably serve as a stable visual baseline even when normal scene changes, camera motion, or gameplay variations occur.

What would settle it

Running the same five VLMs on the RefGlitch dataset and real videos while comparing single-frame prompting against reference-guided prompting and finding no consistent accuracy gain at the frame level would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.11082 by Adri\'an Barahona-R\'ios, Ashley Wiens, Benedict Wilkins, Cor-Paul Bezemer, Nabajeet Barman, Saman Zadtootaghaj, Yakun Yu.

Figure 1
Figure 1. Figure 1: Reference/test pairs from the synthetic RefGlitch dataset. Top: a missing￾object (parts of a character/object are unexpectedly absent) example with GPT-5’s judgments with (Ref ) and without (NoRef ) a reference. Bottom: paired examples for clipping (a character/object intersects or passes through a solid surface), floating (a character/object is not in contact with the surface it should rest on), corrupted… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our RESP framework that consists of three stages. (1) Keyframe extraction: we extract a compact set of representative frames. (2) reference-guided se￾quential prompting: we process frames in temporal order and prompt a VLM to predict whether the current test frame is glitchy, conditioned on an additional reference frame selected from a reference pool of earlier frames. Each stored reference is … view at source ↗
Figure 3
Figure 3. Figure 3: Per-category behavior of the VLMs visualized as radar plots over six classes (Glitch-free, Missing object, Clipping, Floating, Corrupted texture, Lighting issue), with lines color-coded by settings. Please zoom-in for better view. ferences) can be mistaken for glitch evidence even when the reference frame is scene-matched. Irrelevant reference frames often hurt performance, confirming that reference matchi… view at source ↗
Figure 4
Figure 4. Figure 4: Example glitches present in the RefGlitch dataset. We generated five types of glitches: 1. Missing object: created by temporarily hiding parts of the player model during otherwise normal actions. We attached a script to the target body part and bound its visibility to a hotkey for smooth toggling. 2. Clipping: implemented via a toggleable “noclip” mode that disables charac￾ter collisions while keeping the … view at source ↗
Figure 5
Figure 5. Figure 5: The default prompt given a reference/test pair [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The default prompt given a test frame only [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of training set size on aggregation performance (violin plots across 1000 runs). Training Size Ablation Study. Regarding how much labeled data is needed to train a lightweight aggregator, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of “missing object” from the RefGlitch dataset: top row shows the reference/test frames, middle and bottom rows show VLM outputs without (NoRef ) vs. with the reference frame (Ref ). Figures 8 –14 are examples including reference/test pairs and the correspond￾ing outputs from our best closed-source and open-source VLMs with and without a reference frame. These cases help illustrate how providing a … view at source ↗
Figure 9
Figure 9. Figure 9: Example of “clipping” from the RefGlitch dataset: top row shows the refer￾ence/test frames, middle and bottom rows show VLM outputs without (NoRef ) vs. with the reference frame (Ref ) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of “floating” from the RefGlitch dataset: top row shows the refer￾ence/test frames, middle and bottom rows show VLM outputs without (NoRef ) vs. with the reference frame (Ref ) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of “corrupted texture” from the RefGlitch dataset: top row shows the reference/test frames, middle and bottom rows show VLM outputs without (NoRef ) vs. with the reference frame (Ref ) [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of “lighting issue” from the RefGlitch dataset: top row shows the reference/test frames, middle and bottom rows show VLM outputs without (NoRef ) vs. with the reference frame (Ref ) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example reference/test pair from the PhysGame dataset, with Qwen3-VL￾8B outputs without a reference (NoRef ) and with an automatically selected reference (LastCleanFrame). Correctly classifying this test frame is crucial for determining that the video contains a glitch [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example reference/test pair from the VideoGameQA-Bench dataset, with Qwen3-VL-8B outputs without a reference (NoRef ) and with an automatically se￾lected reference (LastCleanFrame). Correctly classifying this test frame is crucial for determining that the video is glitch-free [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RESP, a reference-guided sequential prompting framework for visual glitch detection in video games using VLMs. For each test frame, a reference frame is selected from earlier in the same video to reframe detection as within-video comparison; frame-level VLM predictions are then aggregated into video-level decisions without any fine-tuning. A new synthetic RefGlitch dataset is introduced for controlled analysis, and experiments across five VLMs and three datasets (one synthetic, two real-world) are reported to show consistent gains in frame-level detection that transfer to improved video-level triage.

Significance. If the empirical claims hold under realistic gameplay variation, RESP offers a practical, training-free way to scale automated QA for game development by improving VLM robustness to scene changes. The release of the RefGlitch dataset, code, and data is a clear strength for reproducibility and further research in video anomaly detection.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods: The reference-frame selection rule is never explicitly stated (fixed temporal offset, similarity search, manual choice, etc.). This detail is load-bearing for the central claim that the reference supplies a stable baseline; without it, one cannot evaluate whether normal gameplay variations (camera motion, lighting shifts, character movement) between reference and test frames produce false positives that the aggregation step cannot filter, precisely the scenario the paper targets in realistic QA conditions.
  2. [Experiments and Results] Experiments and Results: The manuscript asserts that reference guidance 'consistently strengthens frame-level detection' and that this 'reliably transfers' to video-level triage, yet reports no statistical significance tests, confidence intervals, per-video variance, or ablation on sequences containing scene transitions. The absence of these controls leaves the transfer claim unsupported even if raw accuracy numbers improve.
minor comments (2)
  1. [Abstract] The abstract mentions 'balanced coverage across five glitch types' in RefGlitch but the dataset construction details (how balance was enforced, labeling protocol) appear only in supplementary material; a brief summary in the main text would improve readability.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the exact reference-selection policy used for each reported number so readers can interpret the results without consulting the code repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The reference-frame selection rule is never explicitly stated (fixed temporal offset, similarity search, manual choice, etc.). This detail is load-bearing for the central claim that the reference supplies a stable baseline; without it, one cannot evaluate whether normal gameplay variations (camera motion, lighting shifts, character movement) between reference and test frames produce false positives that the aggregation step cannot filter, precisely the scenario the paper targets in realistic QA conditions.

    Authors: We agree that the reference-frame selection rule must be stated explicitly to allow proper evaluation of the baseline stability. Although the manuscript indicates selection from earlier in the same video, the specific rule was not sufficiently detailed. In the revised version, we have expanded the Methods section to explicitly describe the reference-frame selection rule employed in our experiments and include a discussion of its implications for handling normal gameplay variations and potential false positives. revision: yes

  2. Referee: [Experiments and Results] Experiments and Results: The manuscript asserts that reference guidance 'consistently strengthens frame-level detection' and that this 'reliably transfers' to video-level triage, yet reports no statistical significance tests, confidence intervals, per-video variance, or ablation on sequences containing scene transitions. The absence of these controls leaves the transfer claim unsupported even if raw accuracy numbers improve.

    Authors: We concur that statistical significance tests, confidence intervals, per-video variance, and ablations on scene transitions would provide stronger support for the claims. We have revised the Experiments and Results section to include these elements: statistical tests for the observed improvements, confidence intervals for the metrics, reporting of per-video performance variance, and an ablation study on sequences with scene transitions. These additions demonstrate that the frame-level gains do transfer to video-level triage under the tested conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical prompting method with no derivations or load-bearing self-citations

full rationale

The paper describes an empirical prompting framework (reference-guided sequential prompting) evaluated via experiments on synthetic and real-world datasets across multiple VLMs. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text or abstract. The core claim rests on experimental results showing improved detection from reference guidance, which is independently testable and does not reduce to a definitional or fitted tautology. The reference-selection assumption is a methodological choice open to ablation, not a circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard VLM prompting and dataset construction without additional postulated mechanisms.

pith-pipeline@v0.9.0 · 5565 in / 1014 out tokens · 71220 ms · 2026-05-10T16:00:04.298114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2504.15408 (2025)

    Backus, J.: Players’ perception of bugs and glitches in video games: An exploratory study. arXiv preprint arXiv:2504.15408 (2025)

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    In: Proceed- ings of the AAAI conference on artificial intelligence

    Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Giani- nazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., et al.: Graph of thoughts: Solving elaborate problems with large language models. In: Proceed- ings of the AAAI conference on artificial intelligence. vol. 38, pp. 17682–17690 (2024)

  4. [4]

    Physgame: Uncovering physical commonsense violations in gameplay videos

    Cao, M., Tang, H., Zhao, H., Guo, H., Liu, J., Zhang, G., Liu, R., Sun, Q., Reid, I., Liang, X.: Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800 (2024)

  5. [5]

    In: 2019 IEEE Conference on Games (CoG)

    Chang, K., Aytemiz, B., Smith, A.M.: Reveal-more: Amplifying human effort in quality assurance testing using automated exploration. In: 2019 IEEE Conference on Games (CoG). pp. 1–8. IEEE (2019)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  7. [7]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  8. [8]

    https://ffmpeg.org/ (2026)

    Developers, F.: FFmpeg. https://ffmpeg.org/ (2026)

  9. [9]

    ACM Transactions on Software Engineering and Methodology34(09 2024)

    Guglielmi, E., Bavota, G., Oliveto, R., Scalabrino, S.: Automatic identification of game stuttering via gameplay videos analysis. ACM Transactions on Software Engineering and Methodology34(09 2024)

  10. [10]

    Empirical Software Engineering28(6), 136 (2023)

    Guglielmi, E., Scalabrino, S., Bavota, G., Oliveto, R.: Using gameplay videos for detecting issues in video games. Empirical Software Engineering28(6), 136 (2023)

  11. [11]

    In: International conference on machine learning

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International conference on machine learning. pp. 2790–2799. PMLR (2019)

  12. [12]

    Iclr1(2), 3 (2022) 16 Y

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 16 Y. Yakun et al

  13. [13]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

  14. [14]

    In: FindingsoftheAssociationforComputationalLinguistics:NAACL2024.pp.1281– 1309 (2024)

    Lee, B.W., Cho, H., Yoo, K.M.: Instruction tuning with human curriculum. In: FindingsoftheAssociationforComputationalLinguistics:NAACL2024.pp.1281– 1309 (2024)

  15. [15]

    Advances in neural information processing systems34, 9694–9705 (2021)

    Li,J.,Selvaraju,R.,Gotmare,A.,Joty,S.,Xiong,C.,Hoi,S.C.H.:Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems34, 9694–9705 (2021)

  16. [16]

    Empirical Software Engineering24(1), 170–207 (2019)

    Lin, D., Bezemer, C.P., Zou, Y., Hassan, A.E.: An empirical study of game reviews on the steam platform. Empirical Software Engineering24(1), 170–207 (2019)

  17. [17]

    In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

    Ling, C., Tollmar, K., Gisslén, L.: Using deep convolutional neural networks to detect rendered glitches in video games. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. vol. 16, pp. 66–73 (2020)

  18. [18]

    Ministral 3

    Liu, A.H., Khandelwal, K., Subramanian, S., Jouault, V., Rastogi, A., Sadé, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., et al.: Ministral 3. arXiv preprint arXiv:2601.08584 (2026)

  19. [19]

    arXiv preprint arXiv:2508.04895 (2025)

    Lu, W., Senchenko, A., Hindle, A., Bezemer, C.P.: Automated bug frame retrieval from gameplay videos using vision-language models. arXiv preprint arXiv:2508.04895 (2025)

  20. [20]

    Engineering Applications of Artificial Intelligence 166, 113497 (2026)

    Paduraru, C.: A state-aware, hierarchical deep learning framework for automated visual glitch detection in games. Engineering Applications of Artificial Intelligence 166, 113497 (2026)

  21. [21]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...

  22. [22]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

  23. [23]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV)

    Taesiri, M.R., Bezemer, C.P.: Videogamebunny: Towards vision assistants for video games. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV). pp. 1403–1413. IEEE (2025)

  24. [24]

    Taesiri, M.R., Feng, T., Bezemer, C.P., Nguyen, A.: Glitchbench: Can large mul- timodal models detect video game glitches? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22444–22455 (2024)

  25. [25]

    In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track (2025)

    Taesiri, M.R., Ghildyal, A., Zadtootaghaj, S., Barman, N., Bezemer, C.P.: VideogameQA-bench: Evaluating vision-language models for video game quality assurance. In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track (2025)

  26. [26]

    In: Proceedings of the 19th International Conference on Mining Software Repositories

    Taesiri, M.R., Macklon, F., Bezemer, C.P.: Clip meets gamephysics: Towards bug identification in gameplay videos using zero-shot transfer learning. In: Proceedings of the 19th International Conference on Mining Software Repositories. pp. 270–281 (2022)

  27. [27]

    IEEE Transactions on Games16(3), 697–710 (2024) RESP: Reference-guided Sequential Prompting 17

    Taesiri, M.R., Macklon, F., Habchi, S., Bezemer, C.P.: Searching bug instances in gameplay video repositories. IEEE Transactions on Games16(3), 697–710 (2024) RESP: Reference-guided Sequential Prompting 17

  28. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  29. [29]

    Team, G.: Gemma 3 technical report (2025)

  30. [30]

    arXiv preprint arXiv:2311.10926 (2023)

    Truelove, A., Rong, S., de Almeida, E.S., Ahmed, I.: Finding the needle in a haystack: Detecting bug occurrences in gameplay videos. arXiv preprint arXiv:2311.10926 (2023)

  31. [31]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  32. [32]

    In: 2022 IEEE Conference on Games (CoG)

    Wilkins, B., Stathis, K.: World of bugs: A platform for automated bug detection in 3d video games. In: 2022 IEEE Conference on Games (CoG). pp. 520–523. IEEE (2022)

  33. [33]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Yu, Y., Qi, S.a., Li, B., Niu, D.: Peprec: Progressive enhancement of prompting for recommendation. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 17941–17953 (2024)

  34. [34]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123– 18133 (2022)

  36. [36]

    reasoning

    Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wang, G., et al.: Instruction tuning for large language models: A survey. ACM Computing Surveys58(7), 1–36 (2026) 18 Y. Yakun et al. A Implementation Details forRefGlitch Missing Object Clipping Floating Corrupted texture Lighting issue Fig.4:Example glitches present in ...

  37. [37]

    Clipping into Environment - Parts of the character or object are intersecting with solid objects like walls, floors, trees, or furniture

  38. [38]

    Floating Without Support - Characters or objects are visibly suspended in mid-air or hovering above surfaces with no physical contact or support

  39. [39]

    Deformed or Broken Model - Character models are in default poses (e.g., T-pose), unnaturally stretched, or otherwise malformed

  40. [40]

    Overlapping or Intersecting Characters - Multiple characters occupy the same space, overlapping or clipping into each other

  41. [41]

    Rendering / Texture / Visual Artifacts - Visual content fails to render correctly, causing missing textures, transparency issues, or broken models

  42. [42]

    Animation or Pose Errors - Characters are in inappropriate or frozen animations, not matching their context (e.g., giving a thumbs up when holding a gun)

  43. [43]

    Physics Glitches / Object Instability - Objects behave unrealistically, often flipping, tilting, or becoming unstable in ways that break immersion

  44. [44]

    Gameplay / Logic Errors - Problems with in-game logic, rules, or asset assignments that break intended behavior

  45. [45]

    reasoning

    UI / Interaction Anomalies - Issues where user interface elements, HUD prompts, icons, or interaction mechanics behave incorrectly. Fig.5:The default prompt given a reference/test pair. RESP: Reference-guided Sequential Prompting 21 Prompt Template with a Test Frame Only ** Task Description: ** You are a helpful assistant analyzing video game images and s...