NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming

Alexandre Ballenghien; April Yi Wang; Heejin Do; Yang Wu

arxiv: 2606.22061 · v1 · pith:3KYYQ7F2new · submitted 2026-06-20 · 💻 cs.CL · cs.AI

NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming

Heejin Do , Alexandre Ballenghien , Yang Wu , April Yi Wang This is my paper

Pith reviewed 2026-06-26 11:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords NL2ScratchSemantic Alignment Consistencyblock-based programmingNL2CodeScratchLLM evaluationexecutable benchmarksemantic alignment

0 comments

The pith

NL2Scratch benchmark shows LLMs often fail semantic alignment on Scratch programs despite high lexical scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates NL2Scratch, a benchmark of over 300,000 natural-language descriptions paired with real Scratch programs extracted from public projects. It introduces Semantic Alignment Consistency, a slot-level metric that checks agreement on specific program elements rather than surface token overlap. Experiments across multiple LLMs find that high token F1 scores frequently coincide with imperfect SAC scores, with mistakes clustering on actions, conditions, and numeric values especially in longer scripts. This matters for early programming education because Scratch programs are event-driven and concurrent, so conventional NL2Code metrics miss key failure modes. The authors also release a smaller validated diagnostic set of 800 examples for targeted testing.

Core claim

The paper establishes that lexical metrics such as token-level F1 do not reliably indicate semantic correctness in natural-language-to-Scratch generation; models can score above 0.93 F1 yet still produce programs that mismatch on operational slots, and this discrepancy is measured by the new Semantic Alignment Consistency metric on an executable benchmark of 311,648 parser-valid pairs.

What carries the argument

NL2Scratch executable benchmark of 311,648 NL-program pairs together with the Semantic Alignment Consistency (SAC) slot-level metric that scores agreement on individual program elements such as actions, conditions, and arguments.

If this is right

Evaluation of NL-to-block-code models must move beyond token overlap to slot-level semantic checks.
Operational slots such as actions, conditions, and numeric arguments require targeted model improvements.
Longer Scratch scripts expose larger semantic gaps that short-example benchmarks would miss.
The 23,594 semantically validated examples provide a cleaner test set for future work.
Executable benchmarks are needed for other visual programming languages where conventional metrics fall short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar semantic slot metrics could be adapted for other event-driven or concurrent languages used in education.
Focusing training data on operational elements might close the observed lexical-semantic gap.
The benchmark could support automatic repair systems that fix only the mismatched slots rather than regenerating entire programs.

Load-bearing premise

The 311,648 parser-valid pairs are semantically aligned with their natural-language descriptions and the SAC metric correctly captures that alignment.

What would settle it

Human raters independently scoring a sample of model outputs on the same slot categories used by SAC and finding that high-F1 models achieve near-perfect slot agreement on the validated subset.

Figures

Figures reproduced from arXiv: 2606.22061 by Alexandre Ballenghien, April Yi Wang, Heejin Do, Yang Wu.

**Figure 1.** Figure 1: NL2Scratch data construction pipeline. Raw Scratch projects (.sb files) are converted into normalized scratchblocks. The outputs are then validated through two complementary filters: (1) parser-based validation, which ensures structural validity and renderability as executable Scratch Interface, and (2) SAC, which verifies semantic agreement between NL descriptions and programs across behavioral slots. Onl… view at source ↗

**Figure 3.** Figure 3: Distributions of Flesch Reading Ease and FK [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average SACnl score by semantic slot. behavior details such as event triggers, conditions, actions, or numeric arguments. These discrepancies often remain invisible to overlap-based metrics or grammar-grounded execution success rate. The same pattern is reflected in SACref . Average semantic alignment scores are generally high, indicating that models recover much of the intended program content. However… view at source ↗

**Figure 5.** Figure 5: Effect of reranking strategies on generation quality. We compare first-candidate decoding, likelihoodbased reranking (Lik.), parser-aware reranking (Parse+Lik.), SAC-based reranking (SAC), and parser-constrained SAC reranking (Parse+SAC) for Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct. While likelihood- and parserbased reranking yield modest improvements, SAC-guided reranking consistently produces the … view at source ↗

**Figure 6.** Figure 6: Per-bucket performance by program length. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Block-based programming environments such as Scratch are widely used in early programming education, yet natural-language-to-code (NL2Code) research has focused primarily on text-based languages. Scratch programs are event-driven, visually compositional, and distributed across concurrent scripts, making conventional NL2Code assumptions and evaluation insufficient. We introduce NL2Scratch, an executable benchmark for natural-language-to-Scratch generation comprising 311,648 parser-valid NL--program pairs, whose program side is extracted from real Scratch projects and paired with semantically aligned NL descriptions. For reliable evaluation beyond surface overlap, we propose Semantic Alignment Consistency (SAC), an interpretable slot-level metric for measuring semantic agreement between descriptions and programs. With SAC, we construct a semantically validated pool of 23,594 examples, and a slot-balanced 800 diagnostic benchmark. Experiments across instruction-tuned and fine-tuned LLMs reveal a notable gap between lexical similarity and semantic alignment: models achieving token-level F1 above 0.93 often fail to attain perfect SAC, particularly on longer examples. Errors concentrate on operational slots like actions, conditions, and numeric arguments, exposing failure modes largely invisible under conventional metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NL2Scratch builds a useful new benchmark for NL-to-Scratch but the semantic validation steps for the 23k pool and the SAC metric itself are not described, so the reported gap between token F1 and semantic alignment cannot be interpreted yet.

read the letter

The paper's main contribution is a large executable benchmark of 311k NL-Scratch pairs pulled from real projects, plus a slot-level metric (SAC) meant to catch semantic mismatches that token overlap misses. They also release a validated 23k subset and an 800-example diagnostic set balanced across slots. That fills a real gap: Scratch is common in schools but NL2Code work has stayed on text languages. The experiments show models with high token F1 still drop on operational slots like actions and conditions, which is the kind of concrete finding that could guide future work.

The soft spot is exactly the one the stress-test note flags. The abstract says they built a "semantically validated pool" but gives zero information on how: no annotator count, instructions, IAA numbers, or exclusion rules. Without that, we cannot tell whether the SAC scores reflect real model failures or noise in the labels. The same goes for how the slots were defined and whether SAC was tuned after seeing the errors. If the full paper has a clear validation protocol with reproducible steps, this becomes a solid resource; right now the central claim rests on an unexamined step.

This is for researchers building NL2Code systems for block-based or educational languages. It is worth sending to peer review because the domain is underserved and the benchmark size is substantial, but the authors will need to supply the missing validation details before it can be taken as evidence of model limitations.

Referee Report

2 major / 1 minor

Summary. The paper introduces NL2Scratch, an executable benchmark for NL-to-Scratch generation with 311,648 parser-valid NL-program pairs extracted from real projects and paired with semantically aligned descriptions. It proposes the slot-level Semantic Alignment Consistency (SAC) metric, constructs a validated pool of 23,594 examples plus a slot-balanced 800-example diagnostic set, and reports experiments showing that LLMs with token F1 >0.93 often fail to reach perfect SAC, with errors concentrated on actions, conditions, and numeric arguments.

Significance. If the validation protocol and SAC metric prove reliable, the work supplies a needed executable benchmark for block-based languages that are central to early CS education but absent from most NL2Code research. The emphasis on executable semantics and slot-level errors offers a concrete way to expose limitations of lexical metrics.

major comments (2)

[Abstract] Abstract: the claim that the 23,594-example pool is 'semantically validated' and that the observed F1-SAC gap reflects genuine model failure modes is load-bearing, yet the abstract supplies no information on the validation protocol (annotator count, instructions, IAA, adjudication rules, or exclusion criteria). Without these details the interpretability of SAC and the headline result cannot be assessed.
[Abstract] Abstract: the extraction and semantic-alignment procedure for the full 311,648 parser-valid pairs is described only at the level of 'extracted from real Scratch projects and paired with semantically aligned NL descriptions,' with no further specification of how alignment was established or verified; this underpins every downstream claim about model performance.

minor comments (1)

[Abstract] The abstract states that a 'slot-balanced 800-example diagnostic benchmark' was derived but does not define the slot inventory, balancing procedure, or selection criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that greater specificity is needed there to support the interpretability of our claims and will revise the abstract accordingly while ensuring the body of the paper already contains the supporting details.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the 23,594-example pool is 'semantically validated' and that the observed F1-SAC gap reflects genuine model failure modes is load-bearing, yet the abstract supplies no information on the validation protocol (annotator count, instructions, IAA, adjudication rules, or exclusion criteria). Without these details the interpretability of SAC and the headline result cannot be assessed.

Authors: We agree that the abstract should briefly indicate the validation protocol to allow readers to assess the claims without immediately consulting the body. The full protocol (including annotator procedures, instructions, agreement metrics, adjudication, and exclusion rules) is described in Section 3.2. We will revise the abstract to include a concise clause summarizing the validation approach and will ensure the revised abstract remains within length limits. revision: yes
Referee: [Abstract] Abstract: the extraction and semantic-alignment procedure for the full 311,648 parser-valid pairs is described only at the level of 'extracted from real Scratch projects and paired with semantically aligned NL descriptions,' with no further specification of how alignment was established or verified; this underpins every downstream claim about model performance.

Authors: We acknowledge that the abstract's phrasing is high-level and does not specify the alignment verification steps. The extraction pipeline, parser validation, and semantic-alignment method (including how descriptions were generated and checked for fidelity to the programs) are detailed in Section 3.1. We will revise the abstract to add a short clause indicating that alignment was established via the procedure described in the methods, thereby making the foundation of the benchmark clearer at the abstract level. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark construction with independent empirical claims

full rationale

The paper constructs a dataset of NL-program pairs and proposes the SAC metric for evaluation. No derivations, equations, fitted parameters, or predictions are described that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked. The central result (gap between F1 and SAC) is an empirical observation on the constructed benchmark rather than a self-referential derivation. This matches the default expectation of no significant circularity for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark and evaluation paper; contains no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5735 in / 1003 out tokens · 29850 ms · 2026-06-26T11:54:59.908390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 2 linked inside Pith

[1]

InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Litterbox+: An extensible framework for llm- enhanced scratch static code analysis. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. Tool Demonstration Track. James Finnie-Ansley, Paul Denny, Brett A Becker, An- drew Luxton-Reilly, and James Prather. 2022. The robots are coming: Exploring the impli...

Pith/arXiv arXiv 2022
[2]

InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–23

Studying the effect of ai code generators on supporting novice learners in introductory program- ming. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–23. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii...

2023
[3]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297. Mitchel Resnick, John Maloney, Andrés Monroy- Hernández, Natalie Rusk, Evel...

Pith/arXiv arXiv 2020
[4]

arXiv preprint arXiv:2509.11065

Viscratch: Using large language models and gameplay videos for automated feedback in scratch. arXiv preprint arXiv:2509.11065. Jake Trower and Jeff Gray. 2015. Blockly language creation and applications: Visual programming for media computation and bluetooth robotics control. InProceedings of the 46th ACM Technical Sympo- sium on Computer Science Educatio...

arXiv 2015
[5]

Output only Scratch pseudocode, with one block per line and no markdown
[6]

Do not output opcode keys

Use exact Scratch-style block text from the exam- ples. Do not output opcode keys
[7]

move (10) steps

Stack/command blocks are plain lines, e.g. move (10) steps
[8]

(x position),((score) + (1))

Reporter inputs must be wrapped in parentheses, e.g. (x position),((score) + (1))
[9]

Boolean conditions must be wrapped in angle brack- ets, e.g.<mouse down?>,<touching (edge v)?>
[10]

[message1],[costume1],[score v]

Text/name inputs use square brackets, e.g. [message1],[costume1],[score v]
[11]

(space v),(random position v),[all v]

Menu/dropdown inputs include the v marker, e.g. (space v),(random position v),[all v]
[12]

Control blocks use exact forms such asrepeat (10) , forever, if <condition> then , if <condition> then else,wait until <condition>
[13]

Indent nested blocks with exactly 4 spaces and close every C-block withend
[14]

For if else, use: if <condition> then , true branch,else, false branch,end
[15]

Preserve names, messages, numbers, signs, decimal values, and action order from the natural language
[16]

If no event is described, do not invent one. [Retrieved Examples]Here are some examples: 20 * Example (Natural Language, Pseudocode) [Test Item]Now, please generate the Scratch pseu- docode for the following description: Natural Language:{natrual_language_query} Pseudocode:[To be generated] Natural-Language Rewriting Prompt used for Dataset Construction [...
[17]

Keep the meaning exactly the same
[18]

Do not add or remove any steps
[19]

Keep all numbers exactly the same
[20]

Keep triggers and conditions explicit
[21]

Keep variable, sprite, costume, backdrop, and message names unchanged
[22]

Remove awkward menu markers and quoting when possible
[23]

q key” instead of “q v

For key names, say things like “q key” instead of “q v”
[24]

Do not output pseudocode
[25]

Output only the rewritten natural-language in- struction. [User Prompt Template] Scratchblocks pseudocode: {pseudocode} Base instruction: {base_nl} Rewrite it as one natural child-like instruction: 11 B Additional Dataset Statistics Table 3 reports SAC-slot coverage across the full parser-valid corpus, the SAC-filtered high- confidence pool, and the final...
[26]

extract a structured slot representation from the pseudocode
[27]

extract a corresponding slot representation from the natural language
[28]

forever”, “keep

compare the extracted slots with slot-specific similarity functions. The output includes per-slot scores, an overall alignment score, a binary perfect-alignment flag, a binary high-confidence-alignment flag, and a list of mismatched slots. We define perfect alignment as SAC = 1.0 and high-confidence alignment as SAC≥0.85. Slot schema.SAC uses the ordered ...

[1] [1]

InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Litterbox+: An extensible framework for llm- enhanced scratch static code analysis. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. Tool Demonstration Track. James Finnie-Ansley, Paul Denny, Brett A Becker, An- drew Luxton-Reilly, and James Prather. 2022. The robots are coming: Exploring the impli...

Pith/arXiv arXiv 2022

[2] [2]

InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–23

Studying the effect of ai code generators on supporting novice learners in introductory program- ming. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–23. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii...

2023

[3] [3]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297. Mitchel Resnick, John Maloney, Andrés Monroy- Hernández, Natalie Rusk, Evel...

Pith/arXiv arXiv 2020

[4] [4]

arXiv preprint arXiv:2509.11065

Viscratch: Using large language models and gameplay videos for automated feedback in scratch. arXiv preprint arXiv:2509.11065. Jake Trower and Jeff Gray. 2015. Blockly language creation and applications: Visual programming for media computation and bluetooth robotics control. InProceedings of the 46th ACM Technical Sympo- sium on Computer Science Educatio...

arXiv 2015

[5] [5]

Output only Scratch pseudocode, with one block per line and no markdown

[6] [6]

Do not output opcode keys

Use exact Scratch-style block text from the exam- ples. Do not output opcode keys

[7] [7]

move (10) steps

Stack/command blocks are plain lines, e.g. move (10) steps

[8] [8]

(x position),((score) + (1))

Reporter inputs must be wrapped in parentheses, e.g. (x position),((score) + (1))

[9] [9]

Boolean conditions must be wrapped in angle brack- ets, e.g.<mouse down?>,<touching (edge v)?>

[10] [10]

[message1],[costume1],[score v]

Text/name inputs use square brackets, e.g. [message1],[costume1],[score v]

[11] [11]

(space v),(random position v),[all v]

Menu/dropdown inputs include the v marker, e.g. (space v),(random position v),[all v]

[12] [12]

Control blocks use exact forms such asrepeat (10) , forever, if <condition> then , if <condition> then else,wait until <condition>

[13] [13]

Indent nested blocks with exactly 4 spaces and close every C-block withend

[14] [14]

For if else, use: if <condition> then , true branch,else, false branch,end

[15] [15]

Preserve names, messages, numbers, signs, decimal values, and action order from the natural language

[16] [16]

If no event is described, do not invent one. [Retrieved Examples]Here are some examples: 20 * Example (Natural Language, Pseudocode) [Test Item]Now, please generate the Scratch pseu- docode for the following description: Natural Language:{natrual_language_query} Pseudocode:[To be generated] Natural-Language Rewriting Prompt used for Dataset Construction [...

[17] [17]

Keep the meaning exactly the same

[18] [18]

Do not add or remove any steps

[19] [19]

Keep all numbers exactly the same

[20] [20]

Keep triggers and conditions explicit

[21] [21]

Keep variable, sprite, costume, backdrop, and message names unchanged

[22] [22]

Remove awkward menu markers and quoting when possible

[23] [23]

q key” instead of “q v

For key names, say things like “q key” instead of “q v”

[24] [24]

Do not output pseudocode

[25] [25]

Output only the rewritten natural-language in- struction. [User Prompt Template] Scratchblocks pseudocode: {pseudocode} Base instruction: {base_nl} Rewrite it as one natural child-like instruction: 11 B Additional Dataset Statistics Table 3 reports SAC-slot coverage across the full parser-valid corpus, the SAC-filtered high- confidence pool, and the final...

[26] [26]

extract a structured slot representation from the pseudocode

[27] [27]

extract a corresponding slot representation from the natural language

[28] [28]

forever”, “keep

compare the extracted slots with slot-specific similarity functions. The output includes per-slot scores, an overall alignment score, a binary perfect-alignment flag, a binary high-confidence-alignment flag, and a list of mismatched slots. We define perfect alignment as SAC = 1.0 and high-confidence alignment as SAC≥0.85. Slot schema.SAC uses the ordered ...