pith. machine review for the scientific record. sign in

arxiv: 2604.19905 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models

Aldeida Aleti, Chunyang Chen, Dingbang Wang, Nikola Tomic, Sidong Feng, Tingting Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:47 UTC · model grok-4.3

classification 💻 cs.SE
keywords bug reproductionGUI videosvision-language modelsautomated testingscreen recordingssoftware maintenanceaction segmentationstate comparison
0
0 comments X

The pith

ViBR reproduces bugs from GUI screen videos by using CLIP to segment actions and vision-language models to compare states for replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViBR as a method to automatically convert user-submitted GUI screen recordings of bugs into executable reproductions on the developer's machine. It does this without building app-specific models or requiring explicit touch markers by first using CLIP embedding similarity to divide the video into distinct action segments and then applying vision-language models to perform region-aware comparisons between the recorded states and the live app. A sympathetic reader would care because video bug reports are becoming common yet hard to act on, and successful automation would let developers spend less time manually stepping through recordings. The evaluation shows this pipeline reproduces 72 percent of the tested recordings while beating prior heuristic-based and graph-based approaches.

Core claim

ViBR segments input GUI videos into action boundaries via CLIP-based embedding similarity, then uses vision-language models for region-aware GUI state comparison to guide step-by-step replay of the observed bug, reaching a 72 percent reproduction success rate on the collected recordings.

What carries the argument

The ViBR pipeline that detects action boundaries with CLIP embedding similarity and drives replay through vision-language model comparisons of GUI regions.

If this is right

  • Developers can obtain working bug reproductions directly from video reports instead of watching and manually recreating each one.
  • Methods that depend on pre-built UI transition graphs or instrumented apps become less necessary for routine bug handling.
  • The volume of video-based bug reports that can be processed increases without proportional growth in manual effort.
  • Heuristic image-processing techniques for replay are replaced by model-driven state comparison in many cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segmentation-plus-comparison pattern could extend to reproducing issues in desktop or web applications from screen recordings.
  • Higher-resolution region detection in future vision-language models would likely raise the reproduction rate further on complex interfaces.
  • Embedding ViBR-style replay into bug trackers would let reporters submit videos and receive confirmation of reproduction automatically.

Load-bearing premise

Pre-trained vision-language models can accurately detect action boundaries in arbitrary GUI videos and compare states across apps without app-specific fine-tuning or explicit touch indicators.

What would settle it

A new test collection of 100 GUI bug videos from previously unseen apps where ViBR reproduces fewer than half the cases would show the approach does not generalize at the reported level.

Figures

Figures reproduced from arXiv: 2604.19905 by Aldeida Aleti, Chunyang Chen, Dingbang Wang, Nikola Tomic, Sidong Feng, Tingting Yu.

Figure 1
Figure 1. Figure 1: GUI comparison between recording and device. Despite their growing prevalence, reproducing bugs from GUI recordings remains a manual and error-prone process. Developers must watch the raw footage to infer user actions and the involved GUI elements, which is often ambiguous and time￾consuming. This process becomes even harder due to cross-device inconsistencies, where the same GUI functionality may appear w… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of ViBR. We conduct comprehensive evaluations of ViBR across the three phases of our approach. First, we evaluate our approach in segmenting the action scenes from 75 GUI recordings that are widely￾studied in the prior studies, achieving up to 87%, 85%, and 86% in precision, recall, and F1-score, respectively. Second, we assess our approach in identifying functional consistency between the rec… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of consecutive frame similarity. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The example of prompting GUI state comparison. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The example of prompting bug replay on device. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of failures cases of state-of-the-art baseline GIFdroid in action boundary segmentation. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of failure cases of baselines and ablation studies in GUI comparison. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of failures cases of our approach [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Bug reports play a critical role in software maintenance by helping users convey encountered issues to developers. Recently, GUI screen capture videos have gained popularity as a bug reporting artifact due to their ease of use and ability to retain rich contextual information. However, automatically reproducing bugs from such recordings remains a significant challenge. Existing methods often rely on fragile image-processing heuristics, explicit touch indicators, or pre-constructed UI transition graphs, which require non-trivial instrumentation and app-specific setup. This paper presents ViBR, a lightweight and fully automated approach that reproduces bugs directly from GUI recordings. Specifically, ViBR combines CLIP-based embedding similarity for action boundary segmentation with Vision-Language Models (VLMs) for region-aware GUI state comparison and guided bug replay. Experimental results show that ViBR successfully reproduces 72% of bug recordings, significantly outperforming state-of-the-art baselines and ablation variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents ViBR, a lightweight automated system for reproducing bugs directly from GUI screen-capture videos. It segments action boundaries via CLIP embedding similarity and uses off-the-shelf Vision-Language Models for region-aware GUI state comparison to drive replay, without requiring touch indicators or app-specific instrumentation. The central empirical claim is a 72% reproduction success rate that significantly outperforms state-of-the-art baselines and ablation variants.

Significance. If the performance claims are substantiated, the work would offer a practical advance in software maintenance by enabling fully automated replay from the increasingly common artifact of user-submitted video bug reports. The avoidance of pre-built UI graphs or explicit instrumentation is a clear strength relative to prior approaches.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the headline 72% reproduction rate is presented without any accompanying dataset size, number of recordings, application diversity statistics, or statistical significance tests. This directly undermines assessment of whether the result generalizes beyond the evaluated cases or is driven by easy instances.
  2. [Approach] Approach section (CLIP-based segmentation): the method assumes pre-trained CLIP embeddings reliably detect action boundaries in raw GUI videos lacking touch cues, yet no precision, recall, or boundary-error metrics are reported for this component. Because segmentation errors propagate directly to state comparison and replay, this omission is load-bearing for the overall success-rate claim.
  3. [Evaluation] Evaluation section (VLM state comparison): the region-aware comparison step relies on off-the-shelf VLMs without app-specific fine-tuning or explicit touch indicators, but no ablation isolating segmentation accuracy from comparison accuracy, nor error analysis of false-positive/negative state matches, is provided. This leaves the outperformance over baselines difficult to attribute.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise statement of dataset characteristics (e.g., number of apps, video lengths, bug types) to allow readers to contextualize the 72% figure immediately.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. We have revised the manuscript to incorporate additional context, metrics, and analyses where the comments identify gaps in the original presentation.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the headline 72% reproduction rate is presented without any accompanying dataset size, number of recordings, application diversity statistics, or statistical significance tests. This directly undermines assessment of whether the result generalizes beyond the evaluated cases or is driven by easy instances.

    Authors: We agree that the abstract would benefit from summarizing key evaluation details already present in the Evaluation section. We have revised the abstract to include a concise description of the dataset (number of recordings and application diversity) along with reference to the statistical significance tests reported in the Evaluation section. This provides readers with immediate context for assessing the 72% figure without altering the core claim. revision: yes

  2. Referee: [Approach] Approach section (CLIP-based segmentation): the method assumes pre-trained CLIP embeddings reliably detect action boundaries in raw GUI videos lacking touch cues, yet no precision, recall, or boundary-error metrics are reported for this component. Because segmentation errors propagate directly to state comparison and replay, this omission is load-bearing for the overall success-rate claim.

    Authors: We acknowledge the value of standalone metrics for the segmentation component. In the revised manuscript, we have added precision, recall, and boundary-error metrics for the CLIP-based action boundary detection, computed against manually annotated ground truth on the evaluation videos. These metrics demonstrate the effectiveness of pre-trained CLIP embeddings for GUI videos without touch indicators and include discussion of how residual segmentation errors are handled downstream. revision: yes

  3. Referee: [Evaluation] Evaluation section (VLM state comparison): the region-aware comparison step relies on off-the-shelf VLMs without app-specific fine-tuning or explicit touch indicators, but no ablation isolating segmentation accuracy from comparison accuracy, nor error analysis of false-positive/negative state matches, is provided. This leaves the outperformance over baselines difficult to attribute.

    Authors: We agree that isolating component contributions and providing error analysis improves attribution. The revised Evaluation section now includes additional ablation experiments that separately disable or modify the segmentation and region-aware VLM comparison steps. We have also added a categorized error analysis of false-positive and false-negative state matches, identifying common causes such as visual similarity between GUI states. These changes clarify the sources of ViBR's performance gains relative to the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external baselines with no self-referential derivations or fitted predictions.

full rationale

The paper describes an empirical system (ViBR) that applies off-the-shelf CLIP embeddings for action segmentation and VLMs for state comparison, then reports a 72% reproduction rate from direct experiments on bug recordings. No equations, parameters, or first-principles derivations are present that reduce the reported success metric to quantities fitted from the evaluation data itself. The evaluation is framed as comparison to state-of-the-art baselines and ablation variants, which are external to the method. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes for the core pipeline. The approach is therefore self-contained against external benchmarks; any concerns about generalization of pre-trained models pertain to correctness or assumption strength, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current vision-language models possess sufficient GUI understanding for state comparison; no free parameters are introduced or fitted in the described method, and no new entities are postulated.

axioms (1)
  • domain assumption Pre-trained CLIP and vision-language models can reliably segment actions and compare GUI states from video frames without additional training or instrumentation.
    The pipeline directly invokes these models for boundary detection and state comparison.

pith-pipeline@v0.9.0 · 5458 in / 1243 out tokens · 44213 ms · 2026-05-10T01:47:49.682139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    ankidroid Issue 4707

    2017. ankidroid Issue 4707. https://github.com/ankidroid/Anki-Android/issues/4707

  2. [2]

    ankidroid Issue 4977

    2018. ankidroid Issue 4977. https://github.com/ankidroid/Anki-Android/issues/4977

  3. [3]

    AndBible Issue 261

    2019. AndBible Issue 261. https://github.com/AndBible/and-bible/issues/261

  4. [4]

    BugClipper

    2021. BugClipper. https://bugclipper.com/

  5. [5]

    Command line tools for recording, replaying and mirroring touchscreen events for Android

    2021. Command line tools for recording, replaying and mirroring touchscreen events for Android. https://github.com/ appetizerio/replaykit

  6. [6]

    Python and OpenCV-based scene cut/transition detection program & library

    2021. Python and OpenCV-based scene cut/transition detection program & library. https://github.com/Breakthrough/ PySceneDetect

  7. [7]

    Record the screen on your iPhone, iPad, or iPod touch

    2021. Record the screen on your iPhone, iPad, or iPod touch. https://support.apple.com/en-us/HT207935

  8. [8]

    Take a screenshot or record your screen on your Android device

    2021. Take a screenshot or record your screen on your Android device. https://support.google.com/android/answer/ 9075928?hl=en

  9. [9]

    TestFairy

    2021. TestFairy. https://www.testfairy.com/

  10. [10]

    Video uploads now available across GitHub

    2021. Video uploads now available across GitHub. https://github.blog/news-insights/product-news/video-uploads- available-github/

  11. [11]

    Android Debug Bridge (adb) - Android Developers

    2023. Android Debug Bridge (adb) - Android Developers. https://developer.android.com/studio/command-line/adb

  12. [12]

    Android Uiautomator2 Python Wrapper

    2023. Android Uiautomator2 Python Wrapper. https://github.com/openatx/uiautomator2

  13. [13]

    Introducing ChatGPT

    2023. Introducing ChatGPT. https://chat.openai.com/

  14. [14]

    OpenAI Codex

    2023. OpenAI Codex. https://openai.com/blog/openai-codex

  15. [15]

    Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

    2025. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3- 5-models-and-computer-use

  16. [16]

    Jorge Aranda and Gina Venolia. 2009. The secret life of bugs: Going past the errors and omissions in software repositories. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 298–308

  17. [17]

    Carlos Bernal-Cárdenas, Nathan Cooper, Kevin Moran, Oscar Chaparro, Andrian Marcus, and Denys Poshyvanyk

  18. [18]

    InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering

    Translating video recordings of mobile app usages into replayable scenarios. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 309–321

  19. [19]

    Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report?. InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 308–318

  20. [20]

    Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann, and Sunghun Kim. 2008. Extracting structural information from bug reports. InProceedings of the 2008 international working conference on Mining software repositories. 27–30

  21. [21]

    Chunyang Chen, Sidong Feng, Zhenchang Xing, Linda Liu, Shengdong Zhao, and Jinshui Wang. 2019. Gallery dc: Design search and knowledge discovery through auto-created gui component gallery.Proceedings of the ACM on Human-Computer Interaction3, CSCW (2019), 1–22

  22. [22]

    Hu Chen, Mingzhe Sun, and Eckehard Steinbach. 2009. Compression of Bayer-pattern video sequences using adjusted chroma subsampling.IEEE transactions on circuits and systems for video technology19, 12 (2009), 1891–1896

  23. [23]

    Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, and Jiachao Zhang. 2024. GPT4Ego: unleashing the potential of pre-trained models for zero-shot egocentric action recognition.IEEE Transactions on Multimedia(2024)

  24. [24]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2022. Fuzzing Deep-Learning Libraries via Large Language Models.arXiv preprint arXiv:2212.14834(2022)

  25. [25]

    Yuning Du, Chenxia Li, Ruoyu Guo, Cheng Cui, Weiwei Liu, Jun Zhou, Bin Lu, Yehua Yang, Qiwen Liu, Xiaoguang Hu, et al. 2021. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System.arXiv preprint arXiv:2109.03144(2021)

  26. [26]

    Mona Erfani Joorabchi, Mehdi Mirzaaghaei, and Ali Mesbah. 2014. Works for me! characterizing non-reproducible bug reports. InProceedings of the 11th Working Conference on Mining Software Repositories. 62–71

  27. [27]

    Mattia Fazzini, Martin Prammer, Marcelo d’Amorim, and Alessandro Orso. 2018. Automatically translating bug reports into test cases for mobile apps. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 141–152

  28. [28]

    2025.Towards Effective Bug Reproduction for Mobile Applications

    Sidong Feng. 2025.Towards Effective Bug Reproduction for Mobile Applications. Ph. D. Dissertation. Monash University

  29. [29]

    Sidong Feng and Chunyang Chen. 2022. Gifdroid: an automated light-weight tool for replaying visual bug reports. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 95–99

  30. [30]

    Sidong Feng and Chunyang Chen. 2022. GIFdroid: automated replay of visual bug reports for Android apps. In Proceedings of the 44th International Conference on Software Engineering. 1045–1057

  31. [31]

    Sidong Feng and Chunyang Chen. 2024. Prompting is all you need: Automated android bug replay with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

  32. [32]

    Sidong Feng and Chunyang Chen. 2026. How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction.arXiv preprint arXiv:2602.11514(2026). Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE145. Publication date: July 2026. FSE145:22 Feng et al

  33. [33]

    Sidong Feng, Chunyang Chen, and Zhenchang Xing. 2022. Gallery dc: Auto-created gui component gallery for design search and knowledge discovery. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 80–84

  34. [34]

    Sidong Feng, Chunyang Chen, and Zhenchang Xing. 2023. Video2Action: Reducing human interactions in action annotation of app tutorial videos. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–15

  35. [35]

    Sidong Feng, Changhao Du, Huaxiao Liu, Qingnan Wang, Zhengwei Lv, Gang Huo, Xu Yang, and Chunyang Chen

  36. [36]

    In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

    Agent for user: Testing multi-user interactive features in tiktok. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 57–68

  37. [37]

    Sidong Feng, Changhao Du, Huaxiao Liu, Qingnan Wang, Zhengwei Lv, Mengfei Wang, and Chunyang Chen. 2025. Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing.arXiv preprint arXiv:2506.17539 (2025)

  38. [38]

    Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, and Aldeida Aleti. 2024. Enabling cost-effective ui automation testing with retrieval-based llms: A case study in wechat. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1973–1978

  39. [39]

    Sidong Feng, Haochuan Lu, Ting Xiong, Yuetang Deng, and Chunyang Chen. 2023. Towards efficient record and replay: A case study in wechat. InProceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 1681–1692

  40. [40]

    Sidong Feng, Suyu Ma, Han Wang, David Kong, and Chunyang Chen. 2024. Mud: Towards a large-scale and noise- filtered ui dataset for modern style ui modeling. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–14

  41. [41]

    Sidong Feng, Mulong Xie, and Chunyang Chen. 2023. Efficiency matters: Speeding up automated testing with gui rendering inference. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 906–918

  42. [42]

    Sidong Feng, Mulong Xie, Yinxing Xue, and Chunyang Chen. 2023. Read it, don’t watch it: Captioning bug recordings automatically. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2349–2361

  43. [43]

    Lorenzo Gomez, Iulian Neamtiu, Tanzirul Azim, and Todd Millstein. 2013. Reran: Timing-and touch-sensitive record and replay for android. In2013 35th International Conference on Software Engineering (ICSE). IEEE, 72–81

  44. [44]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

  45. [45]

    Qing Huang, Yanbang Sun, Zhenchang Xing, Min Yu, Xiwei Xu, and Qinghua Lu. 2023. API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language Model.arXiv preprint arXiv:2301.03987(2023)

  46. [46]

    2012.Troyd: Integration testing for android

    Jinseong Jeon and Jeffrey S Foster. 2012.Troyd: Integration testing for android. Technical Report

  47. [47]

    Andrew J Ko and Brad A Myers. 2006. Barista: An implementation framework for enabling new tools, interaction techniques and views in code editors. InProceedings of the SIGCHI conference on Human Factors in computing systems. 387–396

  48. [48]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. Droidbot: a lightweight ui-guided test input generator for android. In2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 23–26

  49. [49]

    Hui Liu, Mingzhu Shen, Jiahao Jin, and Yanjie Jiang. 2020. Automated classification of actions in bug reports of mobile apps. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 128–140

  50. [50]

    Li Liu, Diji Yang, Sijia Zhong, Kalyana Suma Sree Tholeti, Lei Ding, Yi Zhang, and Leilani Gilpin. 2024. Right this way: Can VLMs Guide Us to See More to Answer Questions?Advances in Neural Information Processing Systems37 (2024), 132946–132976

  51. [51]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision. Springer, 38–55

  52. [52]

    David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.International journal of computer vision60, 2 (2004), 91–110

  53. [53]

    Xing Han Lù, Zdeněk Kasner, and Siva Reddy. 2024. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930(2024)

  54. [54]

    Dmitry Nurmuradov and Renee Bryce. 2017. Caret-HM: recording and replaying Android user sessions with heat map generation using UI state clustering. InProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 400–403

  55. [55]

    Ali Ebrahimi Pourasad and Walid Maalej. 2024. Does GenAI Make Usability Testing Obsolete?arXiv preprint arXiv:2411.00634(2024)

  56. [56]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE145. Publication date: July 2026. ViBR: Automated Bug Replay from Video-ba...

  57. [57]

    Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. InProceedings of the 25th ACM International on Conference on Information and Knowledge Management. 659–668

  58. [58]

    Tomás Soucek and Jakub Lokoc. 2024. Transnet v2: An effective deep network architecture for fast shot transition detection. InProceedings of the 32nd ACM International Conference on Multimedia. 11218–11221

  59. [59]

    Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking automated gui testing for android against real-world bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 119–130

  60. [60]

    Ramadass Sudhir and Lt Dr S Santhosh Baboo. 2011. An efficient CBIR technique with YUV color space and texture features.Computer Engineering and Intelligent Systems2, 6 (2011), 78–85

  61. [61]

    Yulei Sui, Yifei Zhang, Wei Zheng, Manqing Zhang, and Jingling Xue. 2019. Event trace reduction for effective bug replay of Android apps via differential GUI state analysis. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1095–1099

  62. [62]

    Yuxuan Wan, Yi Dong, Jingyu Xiao, Yintong Huo, Wenxuan Wang, and Michael R Lyu. 2024. Mrweb: An exploration of generating multi-page resource-aware web code from ui designs.arXiv preprint arXiv:2412.15310(2024)

  63. [63]

    Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu. 2025. Divide- and-Conquer: Generating UI Code from Screenshots.Proceedings of the ACM on Software Engineering2, FSE (2025), 2099–2122

  64. [64]

    Dingbang Wang, Zhaoxu Zhang, Sidong Feng, William GJ Halfond, and Tingting Yu. 2025. An Empirical Study on Leveraging Images in Automated Bug Report Reproduction.arXiv preprint arXiv:2502.15099(2025)

  65. [65]

    Dingbang Wang, Yu Zhao, Sidong Feng, Zhaoxu Zhang, William GJ Halfond, Chunyang Chen, Xiaoxia Sun, Jiangfan Shi, and Tingting Yu. 2024. Feedback-driven automated whole bug report reproduction for android apps. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1048–1060

  66. [66]

    Shiqi Wang, Abdul Rehman, Zhou Wang, Siwei Ma, and Wen Gao. 2011. SSIM-motivated rate-distortion optimization for video coding.IEEE Transactions on Circuits and Systems for Video Technology22, 4 (2011), 516–529

  67. [67]

    Craig Watman, David Austin, Nick Barnes, Gary Overett, and Simon Thompson. 2004. Fast sum of absolute differences visual landmark detector. InIEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, Vol. 5. IEEE, 4827–4832

  68. [68]

    Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. 2024. Longvlm: Efficient long video understanding via large language models. InEuropean Conference on Computer Vision. Springer, 453–470

  69. [69]

    Mulong Xie, Sidong Feng, Zhenchang Xing, Jieshan Chen, and Chunyang Chen. 2020. UIED: a hybrid tool for GUI element detection. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1655–1659

  70. [70]

    Mulong Xie, Zhenchang Xing, Sidong Feng, Xiwei Xu, Liming Zhu, and Chunyang Chen. 2022. Psychologically-inspired, unsupervised inference of perceptual groups of GUI widgets from GUI images. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 332–343

  71. [71]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441(2023)

  72. [72]

    Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2023. Learning video representations from large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6586–6597

  73. [73]

    Yu Zhao, Tingting Yu, Ting Su, Yang Liu, Wei Zheng, Jingzhi Zhang, and William GJ Halfond. 2019. Recdroid: automatically reproducing android application crashes from bug reports. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 128–139

  74. [74]

    Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang. 2025. DeclarUI: Bridging Design and Development with Automated Declarative UI Code Generation.Proceedings of the ACM on Software Engineering2, FSE (2025), 219–241. Received 2026-02-24; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE145. Publication date: July 2026