pith. machine review for the scientific record. sign in

arxiv: 2604.17820 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords Scratchautomated assessmentvideo analysislarge language modelsblock-based programmingeducational technologyprogram evaluation
0
0 comments X

The pith

Raven assesses Scratch programs by having large language models analyze videos of their executions against shared task rules instead of writing per-program tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that task-level video rules plus LLM video analysis can deliver accurate automated grading for Scratch despite the wide variety of student implementations. This matters because Scratch programs are event-driven and visually driven, so traditional code assertions or fixed outputs break easily and force most classrooms to rely on slow manual review. Instructors define once what a correct run should look like when recorded as video, then Raven watches each submission's execution and checks behavioral match. Evaluation on 13 real assignments covering more than 140 submissions shows clear gains in accuracy and consistency over earlier tools, and a separate classroom trial with students and instructors finds the approach usable in practice.

Core claim

Raven replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all submissions. It integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading criteria without requiring explicit test cases or predefined outputs. This design enables consistent evaluation despite substantial diversity in implementation strategies and interaction sequences, as shown by higher accuracy and robustness on 13 real assignments with over 140 ground-truth labeled submissions plus positive results from a classroom study.

What carries the argument

Instructor-specified task-level video generation rules combined with large language model analysis of execution videos to check behavioral compliance.

Load-bearing premise

Large language models can watch videos of program runs and correctly decide whether they match the behavioral criteria without introducing systematic errors or biases across different student coding styles.

What would settle it

A collection of new Scratch submissions where Raven's grades differ from the consensus of several human graders on the same video-based criteria.

Figures

Figures reproduced from arXiv: 2604.17820 by Daming Li, Donglin Li, Hanyuan Shi, Jialu Zhang.

Figure 1
Figure 1. Figure 1: An example Scratch project in which a cat sprite turns around to say hello. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reducing the number of steps taken from 10 to 5 is a reasonable change in real-world evaluations. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual features – such as firework shapes, color transitions, and spatial distribution – can only be [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In a “Math Pea Shooter” game, Whisker fails to provide the correct inputs required to trigger the specific logic for shooting zombies with shells. in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: In real classroom settings, many Scratch assignments allow flexibility in the choices, costumes, and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of the Raven Framework. The system evaluates projects via a dual-track pipeline (Logic and Video) organized into two stages: (1) Task Configuration Stage: The instructor initializes a Unconfigured Assignment (orange-yellow box) to define specific Task Requirements, Logic Grading Rules and Video Grading Rules (light yellow boxes). (2) Student Submission and Evaluation Stage: A student’s Scratch… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for the logic_checker inspection to verify code structures against task requirements, followed by the video generation based on instructor-specified rules. These videos undergo robust visual analysis using a multi-run strategy to mitigate VLM stochasticity before being synthesized with logic checks to produce the final grading results. A Scratch project is typically provided in .sb3 format, which is… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the video_checker The video generation module executes the student project according to the instructor-specified video generation rules and records one or more videos. These videos serve as the sole evidence for video-based grading. To handle interactive ask blocks, where traditional tools rely on random or heuristic responses, Raven incorporates a lightweight vision model Qwen-vl-plus [5] that … view at source ↗
Figure 9
Figure 9. Figure 9: Agreement between Raven and human instructor scores as shown in scatterplots with 146 submissions. A jittering effect with a magnitude of 1.12% of the data range is applied on overlapping data points for visualization. Each scatterplot (Raven against each instructor) exhibits strong correlations with fitted line close to y=x. Raw Scores Grading Results: Raven vs. Human Instructors. Compared to Whisker, Rav… view at source ↗
Figure 10
Figure 10. Figure 10: Survey results collected after a live classroom study conducted at an after-school education center. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Block-based programming environments such as Scratch are widely used in introductory computing education, yet scalable and reliable automated assessment remains elusive. Scratch programs are highly heterogeneous, event-driven, and visually grounded, which makes traditional assertion-based or test-based grading brittle and difficult to scale. As a result, assessment in real Scratch classrooms still relies heavily on manual inspection and delayed feedback, introducing inconsistency across instructors and limiting scalability. We present Raven, an automated assessment framework for Scratch that replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all student submissions. Raven integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading criteria, without requiring explicit test cases or predefined outputs. This design enables consistent evaluation despite substantial diversity in implementation strategies and interaction sequences. We evaluate Raven on 13 real Scratch assignments comprising over 140 student submissions with ground-truth labels from human graders. The results show that Raven significantly outperforms prior automated assessment tools in both grading accuracy and robustness across diverse programming styles. A classroom study with 30 students and 10 instructors further demonstrates strong user acceptance and practical applicability. Together, these findings highlight the effectiveness of task-level behavioral abstractions for scalable assessment of open-ended, event-driven programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Raven, an automated assessment framework for Scratch programs that replaces program-specific assertions with instructor-specified task-level video generation rules. It integrates LLMs with video analysis to evaluate observed behaviors against grading criteria. Evaluation on 13 real assignments with over 140 student submissions (human ground-truth labels) claims significant outperformance over prior tools in accuracy and robustness across diverse styles; a classroom study with 30 students and 10 instructors reports strong user acceptance.

Significance. If the empirical results hold, the shift to task-level behavioral abstractions via video-grounded LLM analysis addresses a core limitation of brittle, code-specific testing in heterogeneous, event-driven block-based environments. This could enable more scalable and consistent automated feedback in introductory computing education. The classroom study provides practical validation beyond lab metrics.

major comments (2)
  1. [Evaluation] Evaluation section: the claim that Raven 'significantly outperforms prior automated assessment tools' is central but unsupported by reported metrics, baseline details, error analysis, or statistical tests (e.g., no precision/recall/F1 breakdowns, no p-values, no confusion matrices). Without these, the strength of the outperformance and robustness claims cannot be verified from the given results.
  2. [Raven Framework] Raven Framework / Methods: the integration of LLMs for video analysis of compliance with behavioral rules lacks concrete details on prompting strategies, decision criteria for 'satisfaction,' handling of edge cases in event-driven executions, or validation of LLM outputs against human judgments. This is load-bearing for reproducibility and for ruling out systematic biases in the weakest assumption (reliable LLM video analysis across heterogeneous implementations).
minor comments (2)
  1. [Abstract] Abstract: specify the exact performance metrics (e.g., accuracy, F1) and the identities of the 'prior automated assessment tools' to strengthen the summary of results.
  2. [Classroom Study] The classroom study description would benefit from more detail on the protocol, survey instruments, and quantitative acceptance measures to allow readers to assess the strength of the 'strong user acceptance' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of task-level video-grounded evaluation for Scratch assessment. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that Raven 'significantly outperforms prior automated assessment tools' is central but unsupported by reported metrics, baseline details, error analysis, or statistical tests (e.g., no precision/recall/F1 breakdowns, no p-values, no confusion matrices). Without these, the strength of the outperformance and robustness claims cannot be verified from the given results.

    Authors: We agree that the evaluation would be more convincing with additional granularity. The current manuscript reports overall accuracy gains and robustness across 13 assignments with human ground-truth labels, but does not include the requested breakdowns. In the revised version we will add per-assignment and aggregate precision/recall/F1 scores, confusion matrices, explicit baseline implementations and their results, a categorized error analysis of disagreements, and statistical significance tests (e.g., McNemar’s test with p-values) to substantiate the outperformance claims. revision: yes

  2. Referee: [Raven Framework] Raven Framework / Methods: the integration of LLMs for video analysis of compliance with behavioral rules lacks concrete details on prompting strategies, decision criteria for 'satisfaction,' handling of edge cases in event-driven executions, or validation of LLM outputs against human judgments. This is load-bearing for reproducibility and for ruling out systematic biases in the weakest assumption (reliable LLM video analysis across heterogeneous implementations).

    Authors: We concur that these details are essential for reproducibility. The revised Methods section will specify the prompting strategies (including chain-of-thought and few-shot examples), the precise decision criteria for rule satisfaction (e.g., LLM confidence thresholds or voting schemes), explicit handling of event-driven edge cases such as timing variability and multi-sprite interactions, and a validation experiment comparing LLM judgments to human annotations on a held-out video sample. We will also discuss observed biases and mitigation steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an empirical systems paper whose central claims rest on direct comparison of Raven's outputs against independent human ground-truth labels for 140 student submissions across 13 assignments, plus a separate classroom study. No mathematical derivations, parameter fittings, or self-referential predictions appear in the provided text; evaluation metrics are computed from external annotations rather than any internal construction that reduces to the paper's own inputs. The design choices (video-grounded task-level rules) are presented as engineering decisions validated by the external benchmark, with no load-bearing self-citations or ansatzes that collapse the result to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems and HCI paper. No mathematical free parameters, formal axioms, or newly invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1038 out tokens · 29745 ms · 2026-05-10T04:49:41.175220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 60 canonical work pages · 6 internal anchors

  1. [1]

    Dimah Al-Fraihat, Mike Joy, Ra’ed Masa’deh, and Jane Sinclair. 2020. Evaluating E-learning systems success: An empirical study.Computers in Human Behavior102 (2020), 67–86. doi:10.1016/j.chb.2019.08.004

  2. [2]

    Kirsti M Ala-Mutka. 2005. A Survey of Automated Assessment Approaches for Programming Assignments.Computer Science Education15, 2 (2005), 83–102. doi:10.1080/08993400500150747

  3. [3]

    Sinan Ariyurek, Aysu Betin-Can, and Elif Surer. 2021. Automated Video Game Testing Using Synthetic and Humanlike Agents.IEEE Transactions on Games13, 1 (2021), 50–67. doi:10.1109/TG.2019.2947597

  4. [4]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al . 2025. Qwen3-VL Technical Report. arXiv:2511.21631 [cs.CV] https: //arxiv.org/abs/2511.21631

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, et al . 2025. Qwen2.5-VL Technical Report. arXiv:2502.13923 [cs.CV] https: //arxiv.org/abs/2502.13923

  6. [6]

    Mohammad Bajammal, Andrea Stocco, Davood Mazinanian, and Ali Mesbah. 2022. A Survey on the Use of Computer Vision to Improve Software Engineering Tasks.IEEE Transactions on Software Engineering48, 5 (2022), 1722–1742. doi:10.1109/TSE.2020.3032986

  7. [7]

    Ishan Banerjee, Bao Nguyen, Vahid Garousi, and Atif Memon. 2013. Graphical user interface (GUI) testing: Systematic mapping and repository.Information and Software Technology55, 10 (2013), 1679–1694. doi:10 .1016/j.infsof .2013.03.004

  8. [8]

    Bryce Boe, Charlotte Hill, Michelle Len, Greg Dreschler, Phillip Conrad, and Diana Franklin. 2013. Hairball: lint- inspired static analysis of scratch projects. InProceeding of the 44th ACM Technical Symposium on Computer Science Education(Denver, Colorado, USA)(SIGCSE ’13). Association for Computing Machinery, New York, NY, USA, 215–220. doi:10.1145/2445...

  9. [9]

    Janet Carter, Kirsti Ala-Mutka, Ursula Fuller, Martin Dick, John English, William Fone, and Judy Sheard. 2003. How shall we assess this?. InWorking Group Reports from ITiCSE on Innovation and Technology in Computer Science Education(Thessaloniki, Greece)(ITiCSE-WGR ’03). Association for Computing Machinery, New York, NY, USA, 107–123. doi:10.1145/960875.960539

  10. [10]

    Cecilia Ka Yuk Chan and Wenjie Hu. 2023. Students’ voices on generative AI: perceptions, benefits, and challenges in higher education.International Journal of Educational Technology in Higher Education20, 1 (July 2023), 43. doi:10 .1186/ s41239-023-00411-8

  11. [11]

    Li-Hsin Chang and Filip Ginter. 2024. Automatic Short Answer Grading for Finnish with ChatGPT.Proceedings of the AAAI Conference on Artificial Intelligence38, 21 (Mar. 2024), 23173–23181. doi:10.1609/aaai.v38i21.30363

  12. [12]

    Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. 2010. GUI testing using computer vision. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA)(CHI ’10). Association for Computing Machinery, New York, NY, USA, 1535–1544. doi:10.1145/1753326.1753555

  13. [13]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374. https://arxiv.org/abs/2107.03374

  14. [14]

    Fred D. Davis. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology.MIS Q.13, 3 (Sept. 1989), 319–340. doi:10.2307/249008

  15. [15]

    Adina Deiner, Patric Feldmeier, Gordon Fraser, Sebastian Schweikl, and Wengran Wang. 2023. Automated Test Generation for Scratch Programs.Empirical Software Engineering28, 1 (2023), 79. doi:10.1007/s10664-022-10255-x

  16. [16]

    Adina Deiner, Christoph Frädrich, Gordon Fraser, Sophia Geserer, and Niklas Zantner. 2020. Search-based Testing for Scratch Programs.CoRRabs/2009.04115 (2020). arXiv:2009.04115 https://arxiv.org/abs/2009.04115

  17. [17]

    Adina Deiner and Gordon Fraser. 2024. NuzzleBug: Debugging Block-Based Programs in Scratch. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE ’24). 1—2. doi:10.1145/3597503.3623331

  18. [18]

    Benedikt Fein, Florian Obermüller, and Gordon Fraser. 2022. CATNIP: An Automated Hint Generation Tool for Scratch. InProceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1(Dublin, Ireland)(ITiCSE ’22). Association for Computing Machinery, New York, NY, USA, 124–130. doi:10.1145/3502718.3524820

  19. [19]

    Andrina Granić and Nikola Marangunić. 2019. Technology acceptance model in educational context: A systematic literature review.British Journal of Educational Technology50, 5 (2019), 2572–2593. doi:10.1111/bjet.12864

  20. [20]

    Christian Grévisse. 2024. LLM-based automatic short answer grading in undergraduate medical education.BMC Medical Education24, 1 (Sept. 2024), 1060. doi:10.1186/s12909-024-06026-5 , Vol. 1, No. 1, Article . Publication date: April 2026. 20 Donglin Li, Daming Li, Hanyuan Shi, and Jialu Zhang

  21. [21]

    Jialiang Gu, Keren Zhou, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. Context-Aware Feedback Compression in Online Judge Programming with LLMs. InProceedings of the 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE Companion ’26)(Montreal, QC, Canada, 5–9 July 2026) (FSE Companion ’26)....

  22. [22]

    Katharina Götz, Patric Feldmeier, and Gordon Fraser. 2022. Model-based Testing of Scratch Programs. arXiv:2202.06271 [cs.SE] https://arxiv.org/abs/2202.06271

  23. [23]

    Felienne Hermans and Efthimia Aivaloglou. 2016. Do code smells hamper novice programming? A controlled experiment on Scratch programs. In2016 IEEE 24th International Conference on Program Comprehension (ICPC). 1–10. doi:10.1109/ICPC.2016.7503706

  24. [24]

    Silas Hsu, Tiffany Wenting Li, Zhilin Zhang, Max Fowler, Craig Zilles, and Karrie Karahalios. 2021. Attitudes Surrounding an Imperfect AI Autograder(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 681, 15 pages. doi:10.1145/3411764.3445424

  25. [25]

    Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, et al. 2024. CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 6...

  26. [26]

    Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2017. Code Quality Issues in Student Programs. InProceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE ’17). Association for Computing Machinery, New York, NY, USA, 110–115. doi:10.1145/3059009.3059061

  27. [27]

    Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. 2019. A Systematic Literature Review of Automated Feedback Generation for Programming Exercises.ACM Transactions on Computing Education19, 1 (2019), 3:1–3:43. doi:10 .1145/ 3231711

  28. [28]

    Juho Leinonen, Paul Denny, Stephen MacNeil, et al. 2023. Comparing Code Explanations Created by Students and Large Language Models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 124–130. doi:10.1145/3587102.3588785

  29. [29]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. DroidBot: a lightweight UI-Guided test input generator for android. In2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). 23–26. doi:10.1109/ICSE-C.2017.8

  30. [30]

    Xiaoyun Liang, Jiayi Qi, Yongqiang Gao, Chao Peng, and Ping Yang. 2023. AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, Ne...

  31. [31]

    2024.How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

    Qianou Ma, Hua Shen, Kenneth Koedinger, and Sherry Tongshuang Wu. 2024.How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging. Springer Nature Switzerland, 265–279. doi:10 .1007/978-3-031- 64302-6_19

  32. [32]

    John Maloney, Mitchel Resnick, Natalie Rusk, Brian Silverman, and Evelyn Eastmond. 2010. The Scratch Programming Language and Environment. 10, 4, Article 16 (Nov. 2010), 15 pages. doi:10.1145/1868358.1868363

  33. [33]

    Marcus Messer, Neil C. C. Brown, Michael Kölling, and Miaojing Shi. 2024. Automated Grading and Feedback Tools for Programming Education: A Systematic Review. 24, 1, Article 10 (Feb. 2024), 43 pages. doi:10.1145/3636515

  34. [34]

    Jesús Moreno-León and Gregorio Robles. 2015. Dr. Scratch: a Web Tool to Automatically Evaluate Scratch Projects. In Proceedings of the Workshop in Primary and Secondary Computing Education(London, United Kingdom)(WiPSCE ’15). Association for Computing Machinery, New York, NY, USA, 132–133. doi:10.1145/2818314.2818338

  35. [35]

    Tushar Nagarajan and Kristen Grauman. 2021. Shaping embodied agent behavior with activity-context priors from egocentric video. InAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 29794–29805. https://proceedings .neurips.cc/ paper_files/p...

  36. [36]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 97, 13 pages. doi:10 .1145/ 3597503.3639187

  37. [37]

    Seong-Guk Nam and Yeong-Seok Seo. 2023. GUI Component Detection-Based Automated Software Crash Diagnosis. Electronics12, 11 (2023). doi:10.3390/electronics12112382

  38. [38]

    Nguyen, Bryan Robbins, Ishan Banerjee, and Atif Memon

    Bao N. Nguyen, Bryan Robbins, Ishan Banerjee, and Atif Memon. 2014. GUITAR: an innovative tool for automated testing of GUI-driven software.Automated Software Engineering21, 1 (March 2014), 65–105. doi:10 .1007/s10515-013- 0128-9

  39. [39]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774 , Vol. 1, No. 1, Article . Publication date: April 2026. Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation 21

  40. [40]

    Lahiri, Michael D

    Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-Directed Random Test Generation. In29th International Conference on Software Engineering (ICSE’07). 75–84. doi:10.1109/ICSE.2007.37

  41. [41]

    José Carlos Paiva, José Paulo Leal, and Álvaro Figueira. 2022. Automated Assessment in Computer Science Education: A State-of-the-Art Review.ACM Trans. Comput. Educ.22, 3, Article 34 (June 2022), 40 pages. doi:10.1145/3513140

  42. [42]

    Raphael Pham, Helge Holzmann, Kurt Schneider, and Christian Brüggemann. 2014. Tailoring video recording to support efficient GUI testing and debugging.Software Quality Journal22, 2 (June 2014), 273–292. doi:10 .1007/s11219-013-9206-2

  43. [43]

    Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models.Proceedings of the 16th International Conference on Educational Data Mining (EDM 2023)(2023), 370–377. doi:10.5281/zenodo.8115653

  44. [44]

    Price, Yihuan Dong, and Dragan Lipovac

    Thomas W. Price, Yihuan Dong, and Dragan Lipovac. 2017. iSnap: Towards Intelligent Tutoring in Novice Programming Environments. InSIGCSE. 483–488. doi:10.1145/3017680.3017762

  45. [45]

    MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776

    Mitchel Resnick. 2017.Lifelong Kindergarten: Cultivating Creativity through Projects, Passion, Peers, and Play. MIT Press, Cambridge, MA. https://mitpress.mit.edu/9780262037297/lifelong-kindergarten/

  46. [46]

    Mitchel Resnick, John Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay Silver, Brian Silverman, and Yasmin Kafai. 2009. Scratch: programming for all.Commun. ACM52, 11 (Nov. 2009), 60–67. doi:10.1145/1592761.1592779

  47. [47]

    Reynolds, Abhinandan B

    Zachary P. Reynolds, Abhinandan B. Jayanth, Ugur Koc, et al . 2017. Identifying and Documenting False Positive Patterns Generated by Static Code Analysis Tools. In2017 IEEE/ACM 4th International Workshop on Software Engineering Research and Industrial Practice (SER&IP). 55–61. doi:10.1109/SER-IP.2017..20

  48. [48]

    Koedinger

    Kelly Rivers and Kenneth R. Koedinger. 2017. Data-Driven Hint Generation in Vast Solution Spaces: A Self-Improving Python Programming Tutor.International Journal of Artificial Intelligence in Education27, 1 (2017), 37–64. doi:10 .1007/ s40593-015-0070-z

  49. [49]

    Marcos Román-González, Jesús Moreno-León, and Gregorio Robles. 2017. Complementary Tools for Computational Thinking Assessment. https://www .researchgate.net/publication/ 318469859_Complementary_Tools_for_Computational_Thinking_Assessment

  50. [50]

    Mark Santolucito, Jialu Zhang, Ennan Zhai, Jürgen Cito, and Ruzica Piskac. 2022. Learning CI Configuration Correctness for Early Build Feedback. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 1006–1017. doi:10.1109/SANER53432.2022.00118

  51. [51]

    Ronny Scherer, Fazilat Siddiq, and Jo Tondeur. 2019. The technology acceptance model (TAM): A meta-analytic structural equation modeling approach to explaining teachers’ adoption of digital technology in education.Computers & Education128 (2019), 13–35. doi:10.1016/j.compedu.2018.09.009

  52. [52]

    Sebastian Schweikl and Gordon Fraser. 2025. RePurr: Automated Repair of Block-Based Learners’ Programs.Proc. ACM Softw. Eng.2, FSE, Article FSE067 (June 2025), 24 pages. doi:10.1145/3715786

  53. [53]

    Scratch Foundation. 2026. Scratch Statistics - Scratch Imagine, Program, Share. https://scratch .mit.edu/statistics/. Accessed: 2026-01-29

  54. [54]

    Yuan Si, Simeng Han, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming. arXiv:2602.00757 [cs.SE] https://arxiv.org/abs/2602.00757

  55. [55]

    Yuan Si, Daming Li, Hanyuan Shi, and Jialu Zhang. 2025. ViScratch: Using Large Language Models and Gameplay Videos for Automated Feedback in Scratch. arXiv:2509.11065 [cs.SE] https://arxiv.org/abs/2509.11065

  56. [56]

    Yuan Si, Kyle Qi, Daming Li, Hanyuan Shi, and Jialu Zhang. 2025. Stitch: Step-by-step LLM Guided Tutoring for Scratch. arXiv:2510.26634 [cs.SE] https://arxiv.org/abs/2510.26634

  57. [57]

    Yuan Si, Ming Wang, Daming Li, Hanyuan Shi, and Jialu Zhang. 2026. EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback. arXiv:2603.29624 [cs.SE] https://arxiv.org/abs/2603.29624

  58. [58]

    Andreas Stahlbauer, Marvin Kreis, and Gordon Fraser. 2019. Testing scratch programs automatically. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Tallinn, Estonia)(ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 165–175. doi:10.11...

  59. [59]

    Niko Strijbol, Robbe De Proft, Klaas Goethals, Bart Mesuere, Peter Dawyndt, and Christophe Scholliers. 2024. Blink: An educational software debugger for Scratch.SoftwareX25 (2024), 101617. doi:10.1016/j.softx.2023.101617

  60. [60]

    Shao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph Lim. 2018. Neural Program Synthesis from Diverse Demonstration Videos. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4790–4799. https://proceedings .mlr.press/ v80/sun18a.html

  61. [61]

    Xiaodan Tang, Yue Yin, Qiao Lin, Roxana Hadad, and Xiaoming Zhai. 2020. Assessing computational thinking: A systematic review of empirical studies.Computers & Education148 (2020), 103798. doi:10 .1016/j.compedu.2019.103798

  62. [62]

    Peng Wang, Shuai Bai, Sinan Tan, et al. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://arxiv.org/abs/2409.12191 , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Donglin Li, Daming Li, Hanyuan Shi, and Jialu Zhang

  63. [63]

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. 2024. VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models. arXiv:2406.16338 [cs.CV] https://arxiv .org/ abs/2406.16338

  64. [64]

    An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv .org/ abs/2505.09388

  65. [65]

    Zhengyuan Yang, Linjie Li, Kevin Lin, et al. 2023. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv:2309.17421 [cs.CV] https://arxiv.org/abs/2309.17421

  66. [66]

    Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. 2009. Sikuli: using GUI screenshots for search and automation. InProceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology(Victoria, BC, Canada) (UIST ’09). Association for Computing Machinery, New York, NY, USA, 183–192. doi:10.1145/1622176.1622213

  67. [67]

    Jialu Zhang, José Pablo Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen

  68. [68]

    ACM Program

    PyDex: Repairing Bugs in Introductory Python Assignments using LLMs.Proc. ACM Program. Lang.8, OOPSLA1 (2024), 1100–1124. doi:10.1145/3649850

  69. [69]

    Jialu Zhang, Jialiang Gu, Wangmeiyu Zhang, José Pablo Cambronero, John Kolesar, Ruzica Piskac, Daming Li, and Hanyuan Shi. 2025. A Systematic Study of Time Limit Exceeded Errors in Online Programming Assignments. arXiv:2510.14339 [cs.SE] https://arxiv.org/abs/2510.14339

  70. [70]

    Jialu Zhang, De Li, John Charles Kolesar, Hanyuan Shi, and Ruzica Piskac. 2023. Automated Feedback Generation for Competition-Level Code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering(Rochester, MI, USA)(ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 13 pages. doi:10.1145/35513...

  71. [71]

    Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K. Lahiri. 2022. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). InISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022, Sukyoung Ryu and Yannis Smaragd...

  72. [72]

    Yan Zheng, Xiaofei Xie, Ting Su, Lei Ma, et al. 2019. Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 772–784. doi:10.1109/ASE.2019.00077 , Vol. 1, No. 1, Article . Publication date: April 2026