arxiv: 2604.17820 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation

Donglin Li , Daming Li , Hanyuan Shi , Jialu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords Scratchautomated assessmentvideo analysislarge language modelsblock-based programmingeducational technologyprogram evaluation

0 comments

The pith

Raven assesses Scratch programs by having large language models analyze videos of their executions against shared task rules instead of writing per-program tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that task-level video rules plus LLM video analysis can deliver accurate automated grading for Scratch despite the wide variety of student implementations. This matters because Scratch programs are event-driven and visually driven, so traditional code assertions or fixed outputs break easily and force most classrooms to rely on slow manual review. Instructors define once what a correct run should look like when recorded as video, then Raven watches each submission's execution and checks behavioral match. Evaluation on 13 real assignments covering more than 140 submissions shows clear gains in accuracy and consistency over earlier tools, and a separate classroom trial with students and instructors finds the approach usable in practice.

Core claim

Raven replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all submissions. It integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading criteria without requiring explicit test cases or predefined outputs. This design enables consistent evaluation despite substantial diversity in implementation strategies and interaction sequences, as shown by higher accuracy and robustness on 13 real assignments with over 140 ground-truth labeled submissions plus positive results from a classroom study.

What carries the argument

Instructor-specified task-level video generation rules combined with large language model analysis of execution videos to check behavioral compliance.

Load-bearing premise

Large language models can watch videos of program runs and correctly decide whether they match the behavioral criteria without introducing systematic errors or biases across different student coding styles.

What would settle it

A collection of new Scratch submissions where Raven's grades differ from the consensus of several human graders on the same video-based criteria.

Figures

Figures reproduced from arXiv: 2604.17820 by Daming Li, Donglin Li, Hanyuan Shi, Jialu Zhang.

**Figure 2.** Figure 2: Reducing the number of steps taken from 10 to 5 is a reasonable change in real-world evaluations. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visual features – such as firework shapes, color transitions, and spatial distribution – can only be [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: In a “Math Pea Shooter” game, Whisker fails to provide the correct inputs required to trigger the specific logic for shooting zombies with shells. in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: In real classroom settings, many Scratch assignments allow flexibility in the choices, costumes, and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture of the Raven Framework. The system evaluates projects via a dual-track pipeline (Logic and Video) organized into two stages: (1) Task Configuration Stage: The instructor initializes a Unconfigured Assignment (orange-yellow box) to define specific Task Requirements, Logic Grading Rules and Video Grading Rules (light yellow boxes). (2) Student Submission and Evaluation Stage: A student’s Scratch… view at source ↗

**Figure 7.** Figure 7: Prompt for the logic_checker inspection to verify code structures against task requirements, followed by the video generation based on instructor-specified rules. These videos undergo robust visual analysis using a multi-run strategy to mitigate VLM stochasticity before being synthesized with logic checks to produce the final grading results. A Scratch project is typically provided in .sb3 format, which is… view at source ↗

**Figure 8.** Figure 8: Prompt for the video_checker The video generation module executes the student project according to the instructor-specified video generation rules and records one or more videos. These videos serve as the sole evidence for video-based grading. To handle interactive ask blocks, where traditional tools rely on random or heuristic responses, Raven incorporates a lightweight vision model Qwen-vl-plus [5] that … view at source ↗

**Figure 9.** Figure 9: Agreement between Raven and human instructor scores as shown in scatterplots with 146 submissions. A jittering effect with a magnitude of 1.12% of the data range is applied on overlapping data points for visualization. Each scatterplot (Raven against each instructor) exhibits strong correlations with fitted line close to y=x. Raw Scores Grading Results: Raven vs. Human Instructors. Compared to Whisker, Rav… view at source ↗

**Figure 10.** Figure 10: Survey results collected after a live classroom study conducted at an after-school education center. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Block-based programming environments such as Scratch are widely used in introductory computing education, yet scalable and reliable automated assessment remains elusive. Scratch programs are highly heterogeneous, event-driven, and visually grounded, which makes traditional assertion-based or test-based grading brittle and difficult to scale. As a result, assessment in real Scratch classrooms still relies heavily on manual inspection and delayed feedback, introducing inconsistency across instructors and limiting scalability. We present Raven, an automated assessment framework for Scratch that replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all student submissions. Raven integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading criteria, without requiring explicit test cases or predefined outputs. This design enables consistent evaluation despite substantial diversity in implementation strategies and interaction sequences. We evaluate Raven on 13 real Scratch assignments comprising over 140 student submissions with ground-truth labels from human graders. The results show that Raven significantly outperforms prior automated assessment tools in both grading accuracy and robustness across diverse programming styles. A classroom study with 30 students and 10 instructors further demonstrates strong user acceptance and practical applicability. Together, these findings highlight the effectiveness of task-level behavioral abstractions for scalable assessment of open-ended, event-driven programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Raven's shift to shared task-level video rules with LLMs looks like a practical step forward for handling Scratch's messy diversity, and the real-submission evaluation gives it some weight.

read the letter

Raven replaces the usual per-program test cases with instructor rules that describe expected behaviors in a generated video, then lets an LLM check if the student's Scratch run matches those rules. That directly targets the heterogeneity problem in block-based programs, where the same task can be solved in dozens of ways with different event sequences and visuals. The abstract makes clear this is not just another static analyzer or assertion checker.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Raven, an automated assessment framework for Scratch programs that replaces program-specific assertions with instructor-specified task-level video generation rules. It integrates LLMs with video analysis to evaluate observed behaviors against grading criteria. Evaluation on 13 real assignments with over 140 student submissions (human ground-truth labels) claims significant outperformance over prior tools in accuracy and robustness across diverse styles; a classroom study with 30 students and 10 instructors reports strong user acceptance.

Significance. If the empirical results hold, the shift to task-level behavioral abstractions via video-grounded LLM analysis addresses a core limitation of brittle, code-specific testing in heterogeneous, event-driven block-based environments. This could enable more scalable and consistent automated feedback in introductory computing education. The classroom study provides practical validation beyond lab metrics.

major comments (2)

[Evaluation] Evaluation section: the claim that Raven 'significantly outperforms prior automated assessment tools' is central but unsupported by reported metrics, baseline details, error analysis, or statistical tests (e.g., no precision/recall/F1 breakdowns, no p-values, no confusion matrices). Without these, the strength of the outperformance and robustness claims cannot be verified from the given results.
[Raven Framework] Raven Framework / Methods: the integration of LLMs for video analysis of compliance with behavioral rules lacks concrete details on prompting strategies, decision criteria for 'satisfaction,' handling of edge cases in event-driven executions, or validation of LLM outputs against human judgments. This is load-bearing for reproducibility and for ruling out systematic biases in the weakest assumption (reliable LLM video analysis across heterogeneous implementations).

minor comments (2)

[Abstract] Abstract: specify the exact performance metrics (e.g., accuracy, F1) and the identities of the 'prior automated assessment tools' to strengthen the summary of results.
[Classroom Study] The classroom study description would benefit from more detail on the protocol, survey instruments, and quantitative acceptance measures to allow readers to assess the strength of the 'strong user acceptance' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of task-level video-grounded evaluation for Scratch assessment. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim that Raven 'significantly outperforms prior automated assessment tools' is central but unsupported by reported metrics, baseline details, error analysis, or statistical tests (e.g., no precision/recall/F1 breakdowns, no p-values, no confusion matrices). Without these, the strength of the outperformance and robustness claims cannot be verified from the given results.

Authors: We agree that the evaluation would be more convincing with additional granularity. The current manuscript reports overall accuracy gains and robustness across 13 assignments with human ground-truth labels, but does not include the requested breakdowns. In the revised version we will add per-assignment and aggregate precision/recall/F1 scores, confusion matrices, explicit baseline implementations and their results, a categorized error analysis of disagreements, and statistical significance tests (e.g., McNemar’s test with p-values) to substantiate the outperformance claims. revision: yes
Referee: [Raven Framework] Raven Framework / Methods: the integration of LLMs for video analysis of compliance with behavioral rules lacks concrete details on prompting strategies, decision criteria for 'satisfaction,' handling of edge cases in event-driven executions, or validation of LLM outputs against human judgments. This is load-bearing for reproducibility and for ruling out systematic biases in the weakest assumption (reliable LLM video analysis across heterogeneous implementations).

Authors: We concur that these details are essential for reproducibility. The revised Methods section will specify the prompting strategies (including chain-of-thought and few-shot examples), the precise decision criteria for rule satisfaction (e.g., LLM confidence thresholds or voting schemes), explicit handling of event-driven edge cases such as timing variability and multi-sprite interactions, and a validation experiment comparing LLM judgments to human annotations on a held-out video sample. We will also discuss observed biases and mitigation steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an empirical systems paper whose central claims rest on direct comparison of Raven's outputs against independent human ground-truth labels for 140 student submissions across 13 assignments, plus a separate classroom study. No mathematical derivations, parameter fittings, or self-referential predictions appear in the provided text; evaluation metrics are computed from external annotations rather than any internal construction that reduces to the paper's own inputs. The design choices (video-grounded task-level rules) are presented as engineering decisions validated by the external benchmark, with no load-bearing self-citations or ansatzes that collapse the result to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems and HCI paper. No mathematical free parameters, formal axioms, or newly invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1038 out tokens · 29745 ms · 2026-05-10T04:49:41.175220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 60 canonical work pages · 6 internal anchors

[1]

Dimah Al-Fraihat, Mike Joy, Ra’ed Masa’deh, and Jane Sinclair. 2020. Evaluating E-learning systems success: An empirical study.Computers in Human Behavior102 (2020), 67–86. doi:10.1016/j.chb.2019.08.004

work page doi:10.1016/j.chb.2019.08.004 2020
[2]

Kirsti M Ala-Mutka. 2005. A Survey of Automated Assessment Approaches for Programming Assignments.Computer Science Education15, 2 (2005), 83–102. doi:10.1080/08993400500150747

work page doi:10.1080/08993400500150747 2005
[3]

Sinan Ariyurek, Aysu Betin-Can, and Elif Surer. 2021. Automated Video Game Testing Using Synthetic and Humanlike Agents.IEEE Transactions on Games13, 1 (2021), 50–67. doi:10.1109/TG.2019.2947597

work page doi:10.1109/tg.2019.2947597 2021
[4]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al . 2025. Qwen3-VL Technical Report. arXiv:2511.21631 [cs.CV] https: //arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, et al . 2025. Qwen2.5-VL Technical Report. arXiv:2502.13923 [cs.CV] https: //arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Mohammad Bajammal, Andrea Stocco, Davood Mazinanian, and Ali Mesbah. 2022. A Survey on the Use of Computer Vision to Improve Software Engineering Tasks.IEEE Transactions on Software Engineering48, 5 (2022), 1722–1742. doi:10.1109/TSE.2020.3032986

work page doi:10.1109/tse.2020.3032986 2022
[7]

Ishan Banerjee, Bao Nguyen, Vahid Garousi, and Atif Memon. 2013. Graphical user interface (GUI) testing: Systematic mapping and repository.Information and Software Technology55, 10 (2013), 1679–1694. doi:10 .1016/j.infsof .2013.03.004

2013
[8]

Bryce Boe, Charlotte Hill, Michelle Len, Greg Dreschler, Phillip Conrad, and Diana Franklin. 2013. Hairball: lint- inspired static analysis of scratch projects. InProceeding of the 44th ACM Technical Symposium on Computer Science Education(Denver, Colorado, USA)(SIGCSE ’13). Association for Computing Machinery, New York, NY, USA, 215–220. doi:10.1145/2445...

work page doi:10.1145/2445196.2445265 2013
[9]

Janet Carter, Kirsti Ala-Mutka, Ursula Fuller, Martin Dick, John English, William Fone, and Judy Sheard. 2003. How shall we assess this?. InWorking Group Reports from ITiCSE on Innovation and Technology in Computer Science Education(Thessaloniki, Greece)(ITiCSE-WGR ’03). Association for Computing Machinery, New York, NY, USA, 107–123. doi:10.1145/960875.960539

work page doi:10.1145/960875.960539 2003
[10]

Cecilia Ka Yuk Chan and Wenjie Hu. 2023. Students’ voices on generative AI: perceptions, benefits, and challenges in higher education.International Journal of Educational Technology in Higher Education20, 1 (July 2023), 43. doi:10 .1186/ s41239-023-00411-8

2023
[11]

Li-Hsin Chang and Filip Ginter. 2024. Automatic Short Answer Grading for Finnish with ChatGPT.Proceedings of the AAAI Conference on Artificial Intelligence38, 21 (Mar. 2024), 23173–23181. doi:10.1609/aaai.v38i21.30363

work page doi:10.1609/aaai.v38i21.30363 2024
[12]

Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. 2010. GUI testing using computer vision. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 1535–1544. doi:10.1145/1753326.1753555

work page doi:10.1145/1753326.1753555 2010
[13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374. https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Fred D. Davis. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology.MIS Q.13, 3 (Sept. 1989), 319–340. doi:10.2307/249008

work page doi:10.2307/249008 1989
[15]

Adina Deiner, Patric Feldmeier, Gordon Fraser, Sebastian Schweikl, and Wengran Wang. 2023. Automated Test Generation for Scratch Programs.Empirical Software Engineering28, 1 (2023), 79. doi:10.1007/s10664-022-10255-x

work page doi:10.1007/s10664-022-10255-x 2023
[16]

Adina Deiner, Christoph Frädrich, Gordon Fraser, Sophia Geserer, and Niklas Zantner. 2020. Search-based Testing for Scratch Programs.CoRRabs/2009.04115 (2020). arXiv:2009.04115 https://arxiv.org/abs/2009.04115

work page arXiv 2020
[17]

Adina Deiner and Gordon Fraser. 2024. NuzzleBug: Debugging Block-Based Programs in Scratch. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE ’24). 1—2. doi:10.1145/3597503.3623331

work page doi:10.1145/3597503.3623331 2024
[18]

Benedikt Fein, Florian Obermüller, and Gordon Fraser. 2022. CATNIP: An Automated Hint Generation Tool for Scratch. InProceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1(Dublin, Ireland)(ITiCSE ’22). Association for Computing Machinery, New York, NY, USA, 124–130. doi:10.1145/3502718.3524820

work page doi:10.1145/3502718.3524820 2022
[19]

Andrina Granić and Nikola Marangunić. 2019. Technology acceptance model in educational context: A systematic literature review.British Journal of Educational Technology50, 5 (2019), 2572–2593. doi:10.1111/bjet.12864

work page doi:10.1111/bjet.12864 2019
[20]

Christian Grévisse. 2024. LLM-based automatic short answer grading in undergraduate medical education.BMC Medical Education24, 1 (Sept. 2024), 1060. doi:10.1186/s12909-024-06026-5 , Vol. 1, No. 1, Article . Publication date: April 2026. 20 Donglin Li, Daming Li, Hanyuan Shi, and Jialu Zhang

work page doi:10.1186/s12909-024-06026-5 2024
[21]

Jialiang Gu, Keren Zhou, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. Context-Aware Feedback Compression in Online Judge Programming with LLMs. InProceedings of the 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE Companion ’26)(Montreal, QC, Canada, 5–9 July 2026) (FSE Companion ’26)....

work page doi:10.1145/3803437.3805565 2026
[22]

Katharina Götz, Patric Feldmeier, and Gordon Fraser. 2022. Model-based Testing of Scratch Programs. arXiv:2202.06271 [cs.SE] https://arxiv.org/abs/2202.06271

work page arXiv 2022
[23]

Felienne Hermans and Efthimia Aivaloglou. 2016. Do code smells hamper novice programming? A controlled experiment on Scratch programs. In2016 IEEE 24th International Conference on Program Comprehension (ICPC). 1–10. doi:10.1109/ICPC.2016.7503706

work page doi:10.1109/icpc.2016.7503706 2016
[24]

Silas Hsu, Tiffany Wenting Li, Zhilin Zhang, Max Fowler, Craig Zilles, and Karrie Karahalios. 2021. Attitudes Surrounding an Imperfect AI Autograder(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 681, 15 pages. doi:10.1145/3411764.3445424

work page doi:10.1145/3411764.3445424 2021
[25]

Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, et al. 2024. CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 6...

work page doi:10.1145/3613904.3642773 2024
[26]

Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2017. Code Quality Issues in Student Programs. InProceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE ’17). Association for Computing Machinery, New York, NY, USA, 110–115. doi:10.1145/3059009.3059061

work page doi:10.1145/3059009.3059061 2017
[27]

Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. 2019. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises.ACM Transactions on Computing Education19, 1 (2019), 3:1–3:43. doi:10 .1145/ 3231711

2019
[28]

Juho Leinonen, Paul Denny, Stephen MacNeil, et al. 2023. Comparing Code Explanations Created by Students and Large Language Models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 124–130. doi:10.1145/3587102.3588785

work page doi:10.1145/3587102.3588785 2023
[29]

Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. DroidBot: a lightweight UI-Guided test input generator for android. In2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). 23–26. doi:10.1109/ICSE-C.2017.8

work page doi:10.1109/icse-c.2017.8 2017
[30]

Xiaoyun Liang, Jiayi Qi, Yongqiang Gao, Chao Peng, and Ping Yang. 2023. AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, Ne...

work page doi:10.1145/3611643.3613867 2023
[31]

2024.How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

Qianou Ma, Hua Shen, Kenneth Koedinger, and Sherry Tongshuang Wu. 2024.How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging. Springer Nature Switzerland, 265–279. doi:10 .1007/978-3-031- 64302-6_19

2024
[32]

John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman, and Evelyn Eastmond. 2010. The Scratch Programming Language and Environment. 10, 4, Article 16 (Nov. 2010), 15 pages. doi:10.1145/1868358.1868363

work page doi:10.1145/1868358.1868363 2010
[33]

Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2024. Automated Grading and Feedback Tools for Programming Education: A Systematic Review. 24, 1, Article 10 (Feb. 2024), 43 pages. doi:10.1145/3636515

work page doi:10.1145/3636515 2024
[34]

Jesús Moreno-León and Gregorio Robles. 2015. Dr. Scratch: a Web Tool to Automatically Evaluate Scratch Projects. In Proceedings of the Workshop in Primary and Secondary Computing Education(London, United Kingdom)(WiPSCE ’15). Association for Computing Machinery, New York, NY, USA, 132–133. doi:10.1145/2818314.2818338

work page doi:10.1145/2818314.2818338 2015
[35]

Tushar Nagarajan and Kristen Grauman. 2021. Shaping embodied agent behavior with activity-context priors from egocentric video. InAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 29794–29805. https://proceedings .neurips.cc/ paper_files/p...

2021
[36]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 97, 13 pages. doi:10 .1145/ 3597503.3639187

work page arXiv 2024
[37]

Seong-Guk Nam and Yeong-Seok Seo. 2023. GUI Component Detection-Based Automated Software Crash Diagnosis. Electronics12, 11 (2023). doi:10.3390/electronics12112382

work page doi:10.3390/electronics12112382 2023
[38]

Nguyen, Bryan Robbins, Ishan Banerjee, and Atif Memon

Bao N. Nguyen, Bryan Robbins, Ishan Banerjee, and Atif Memon. 2014. GUITAR: an innovative tool for automated testing of GUI-driven software.Automated Software Engineering21, 1 (March 2014), 65–105. doi:10 .1007/s10515-013- 0128-9

2014
[39]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774 , Vol. 1, No. 1, Article . Publication date: April 2026. Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation 21

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Lahiri, Michael D

Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-Directed Random Test Generation. In29th International Conference on Software Engineering (ICSE’07). 75–84. doi:10.1109/ICSE.2007.37

work page doi:10.1109/icse.2007.37 2007
[41]

José Carlos Paiva, José Paulo Leal, and Álvaro Figueira. 2022. Automated Assessment in Computer Science Education: A State-of-the-Art Review.ACM Trans. Comput. Educ.22, 3, Article 34 (June 2022), 40 pages. doi:10.1145/3513140

work page doi:10.1145/3513140 2022
[42]

Raphael Pham, Helge Holzmann, Kurt Schneider, and Christian Brüggemann. 2014. Tailoring video recording to support efficient GUI testing and debugging.Software Quality Journal22, 2 (June 2014), 273–292. doi:10 .1007/s11219-013-9206-2

2014
[43]

Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models.Proceedings of the 16th International Conference on Educational Data Mining (EDM 2023)(2023), 370–377. doi:10.5281/zenodo.8115653

work page doi:10.5281/zenodo.8115653 2023
[44]

Price, Yihuan Dong, and Dragan Lipovac

Thomas W. Price, Yihuan Dong, and Dragan Lipovac. 2017. iSnap: Towards Intelligent Tutoring in Novice Programming Environments. InSIGCSE. 483–488. doi:10.1145/3017680.3017762

work page doi:10.1145/3017680.3017762 2017
[45]

MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776

Mitchel Resnick. 2017.Lifelong Kindergarten: Cultivating Creativity through Projects, Passion, Peers, and Play. MIT Press, Cambridge, MA. https://mitpress.mit.edu/9780262037297/lifelong-kindergarten/

work page arXiv 2017
[46]

Mitchel Resnick, John Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay Silver, Brian Silverman, and Yasmin Kafai. 2009. Scratch: programming for all.Commun. ACM52, 11 (Nov. 2009), 60–67. doi:10.1145/1592761.1592779

work page doi:10.1145/1592761.1592779 2009
[47]

Reynolds, Abhinandan B

Zachary P. Reynolds, Abhinandan B. Jayanth, Ugur Koc, et al . 2017. Identifying and Documenting False Positive Patterns Generated by Static Code Analysis Tools. In2017 IEEE/ACM 4th International Workshop on Software Engineering Research and Industrial Practice (SER&IP). 55–61. doi:10.1109/SER-IP.2017..20

work page doi:10.1109/ser-ip.2017..20 2017
[48]

Koedinger

Kelly Rivers and Kenneth R. Koedinger. 2017. Data-Driven Hint Generation in Vast Solution Spaces: A Self-Improving Python Programming Tutor.International Journal of Artificial Intelligence in Education27, 1 (2017), 37–64. doi:10 .1007/ s40593-015-0070-z

2017
[49]

Marcos Román-González, Jesús Moreno-León, and Gregorio Robles. 2017. Complementary Tools for Computational Thinking Assessment. https://www .researchgate.net/publication/ 318469859_Complementary_Tools_for_Computational_Thinking_Assessment

2017
[50]

Mark Santolucito, Jialu Zhang, Ennan Zhai, Jürgen Cito, and Ruzica Piskac. 2022. Learning CI Configuration Correctness for Early Build Feedback. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 1006–1017. doi:10.1109/SANER53432.2022.00118

work page doi:10.1109/saner53432.2022.00118 2022
[51]

Ronny Scherer, Fazilat Siddiq, and Jo Tondeur. 2019. The technology acceptance model (TAM): A meta-analytic structural equation modeling approach to explaining teachers’ adoption of digital technology in education.Computers & Education128 (2019), 13–35. doi:10.1016/j.compedu.2018.09.009

work page doi:10.1016/j.compedu.2018.09.009 2019
[52]

Sebastian Schweikl and Gordon Fraser. 2025. RePurr: Automated Repair of Block-Based Learners’ Programs.Proc. ACM Softw. Eng.2, FSE, Article FSE067 (June 2025), 24 pages. doi:10.1145/3715786

work page doi:10.1145/3715786 2025
[53]

Scratch Foundation. 2026. Scratch Statistics - Scratch Imagine, Program, Share. https://scratch .mit.edu/statistics/. Accessed: 2026-01-29

2026
[54]

Yuan Si, Simeng Han, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming. arXiv:2602.00757 [cs.SE] https://arxiv.org/abs/2602.00757

work page arXiv 2026
[55]

Yuan Si, Daming Li, Hanyuan Shi, and Jialu Zhang. 2025. ViScratch: Using Large Language Models and Gameplay Videos for Automated Feedback in Scratch. arXiv:2509.11065 [cs.SE] https://arxiv.org/abs/2509.11065

work page arXiv 2025
[56]

Yuan Si, Kyle Qi, Daming Li, Hanyuan Shi, and Jialu Zhang. 2025. Stitch: Step-by-step LLM Guided Tutoring for Scratch. arXiv:2510.26634 [cs.SE] https://arxiv.org/abs/2510.26634

work page arXiv 2025
[57]

Yuan Si, Ming Wang, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback. arXiv:2603.29624 [cs.SE] https://arxiv.org/abs/2603.29624

work page arXiv 2026
[58]

Andreas Stahlbauer, Marvin Kreis, and Gordon Fraser. 2019. Testing scratch programs automatically. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Tallinn, Estonia)(ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 165–175. doi:10.11...

work page doi:10.1145/3338906.3338910 2019
[59]

Niko Strijbol, Robbe De Proft, Klaas Goethals, Bart Mesuere, Peter Dawyndt, and Christophe Scholliers. 2024. Blink: An educational software debugger for Scratch.SoftwareX25 (2024), 101617. doi:10.1016/j.softx.2023.101617

work page doi:10.1016/j.softx.2023.101617 2024
[60]

Shao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph Lim. 2018. Neural Program Synthesis from Diverse Demonstration Videos. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4790–4799. https://proceedings .mlr.press/ v80/sun18a.html

2018
[61]

Xiaodan Tang, Yue Yin, Qiao Lin, Roxana Hadad, and Xiaoming Zhai. 2020. Assessing computational thinking: A systematic review of empirical studies.Computers & Education148 (2020), 103798. doi:10 .1016/j.compedu.2019.103798

work page arXiv 2020
[62]

Peng Wang, Shuai Bai, Sinan Tan, et al. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://arxiv.org/abs/2409.12191 , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Donglin Li, Daming Li, Hanyuan Shi, and Jialu Zhang

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. 2024. VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models. arXiv:2406.16338 [cs.CV] https://arxiv .org/ abs/2406.16338

work page arXiv 2024
[64]

An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv .org/ abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Zhengyuan Yang, Linjie Li, Kevin Lin, et al. 2023. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv:2309.17421 [cs.CV] https://arxiv.org/abs/2309.17421

work page arXiv 2023
[66]

Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. 2009. Sikuli: using GUI screenshots for search and automation. InProceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology(Victoria, BC, Canada) (UIST ’09). Association for Computing Machinery, New York, NY, USA, 183–192. doi:10.1145/1622176.1622213

work page doi:10.1145/1622176.1622213 2009
[67]

Jialu Zhang, José Pablo Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen
[68]

ACM Program

PyDex: Repairing Bugs in Introductory Python Assignments using LLMs.Proc. ACM Program. Lang.8, OOPSLA1 (2024), 1100–1124. doi:10.1145/3649850

work page doi:10.1145/3649850 2024
[69]

Jialu Zhang, Jialiang Gu, Wangmeiyu Zhang, José Pablo Cambronero, John Kolesar, Ruzica Piskac, Daming Li, and Hanyuan Shi. 2025. A Systematic Study of Time Limit Exceeded Errors in Online Programming Assignments. arXiv:2510.14339 [cs.SE] https://arxiv.org/abs/2510.14339

work page arXiv 2025
[70]

Jialu Zhang, De Li, John Charles Kolesar, Hanyuan Shi, and Ruzica Piskac. 2023. Automated Feedback Generation for Competition-Level Code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA)(ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 13 pages. doi:10.1145/35513...

work page doi:10.1145/3551349.3560425 2023
[71]

Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K. Lahiri. 2022. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). InISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022, Sukyoung Ryu and Yannis Smaragd...

work page doi:10.1145/3533767.3534396 2022
[72]

Yan Zheng, Xiaofei Xie, Ting Su, Lei Ma, et al. 2019. Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 772–784. doi:10.1109/ASE.2019.00077 , Vol. 1, No. 1, Article . Publication date: April 2026

work page doi:10.1109/ase.2019.00077 2019