PyMETA: A Benchmark Dataset for Hierarchical Student Code Error Classification with Python-Interpreter-Based Labels

Chuyue Li; Jingyi Wang; Kazuma Hashimoto; Lingyu Gao; Yu Wu; Ziqi Tang

arxiv: 2606.30610 · v1 · pith:R6IEIHSTnew · submitted 2026-06-29 · 💻 cs.SE

PyMETA: A Benchmark Dataset for Hierarchical Student Code Error Classification with Python-Interpreter-Based Labels

Chuyue Li , Ziqi Tang , Jingyi Wang , Yu Wu , Kazuma Hashimoto , Lingyu Gao This is my paper

Pith reviewed 2026-06-30 04:47 UTC · model grok-4.3

classification 💻 cs.SE

keywords code error classificationstudent submissionsPython datasethierarchical taxonomybenchmark datasetLLM evaluationmulti-error analysis

0 comments

The pith

PyMETA supplies 48,646 Python student submissions with interpreter-derived single-error labels and a 97-sample multi-error subset under a three-level taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PyMETA as a large dataset of student Python code with labels produced by running each submission in the Python interpreter to record the exception raised. Errors are organized from a top-level error versus no-error split down through intermediate categories to 14 fine-grained types that follow Python's built-in exception classes. A small expert-annotated subset allows testing cases where multiple errors coexist in one program. The authors then run classification experiments on finetuned models and prompted LLMs to measure accuracy at each level of the hierarchy and to record differences in error-type bias. A reader would care because educational debugging tools need consistent, large-scale data to train and compare systems that must handle both isolated and overlapping errors.

Core claim

The paper establishes PyMETA as a benchmark containing 48,646 single-error labeled student submissions plus 97 multi-error expert samples, organized by a three-level hierarchy grounded in Python's official exception structure, and shows through direct model comparisons that finetuned smaller models outperform prompted LLMs while most LLMs over-predict Logic Error.

What carries the argument

Python-interpreter-based labeling that executes each student submission and records the first exception raised to assign a single-error label.

If this is right

Finetuned models achieve higher macro F1 scores than prompted LLMs on the same hierarchical tasks.
LLMs exhibit systematic over-classification of submissions as Logic Error, with measurable differences across model families.
The same taxonomy supports both single-error and multi-error evaluation under a contains criterion.
Performance gaps appear consistently across the 14 fine-grained error types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could support development of feedback systems that point students to the most specific error category first.
Observed LLM biases suggest targeted fine-tuning on underrepresented error types might reduce over-prediction of Logic Error.
The hierarchical structure makes it possible to train models that first detect presence of error before naming its type.

Load-bearing premise

Running the code in the interpreter produces labels that match the actual primary error present in the student's submission even when other undetected errors exist.

What would settle it

Expert manual review of a random subset of the 48,646 samples that finds frequent mismatches between the interpreter label and the dominant error type visible in the code.

Figures

Figures reproduced from arXiv: 2606.30610 by Chuyue Li, Jingyi Wang, Kazuma Hashimoto, Lingyu Gao, Yu Wu, Ziqi Tang.

**Figure 2.** Figure 2: Row-Normalized Confusion Matrices for GPT [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Multi-Error Coverage Rate by predicted error [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Question ID Distribution [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: User Distribution A.5 Dataset Preprocess and Annotation The dataset used in this study is derived from the Moodle-based educational platform developed by our team, which provides anonymized user interaction logs (e.g., attempt IDs, attempt step IDs) for research purposes. All identifiers have been irreversibly anonymized prior to analysis, ensuring that no personally identifiable information (PII) is pres… view at source ↗

**Figure 9.** Figure 9: CodeBERT Task C Multi-class Classification [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 8.** Figure 8: CodeBERT Task A Binary Classification Con [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 11.** Figure 11: CodeLlama-7b TaskA Binary Classification [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 10.** Figure 10: Entropy-topk-F1 Trend Multi-Error Diagnos [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 14.** Figure 14: DeepSeek-V3 Row-Normalized Confusion Matrix of Single-Error Prompting Classification. while Gemini 2.5 Pro models obtain higher accuracy at a higher computational cost. From the results, we observe the following trends. Gemini 2.5 Pro obtains comparable performance (85.0%) but requires significantly longer inference time (24 hours). GPT-4o reaches 71.5% and DeepSeek-V3 73.6%, though GPT-4o is much faste… view at source ↗

**Figure 15.** Figure 15: GPT-4o Row-Normalized Confusion Matrix of Single-Error Prompting Classification. which Syntax Error instances are predicted as Logic Error (SE→LE). We find that such cases often involve code with syntactic errors whose surrounding structure still suggests what the student was trying to do. In these cases, models tend to pick up on the intended logic rather than flag the syntactic problem. This is consiste… view at source ↗

**Figure 16.** Figure 16: Single-Error Prompting Classification Accuracy by True Error Type. Per-class accuracy for all evaluated [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

read the original abstract

With the advancement of Large Language Models (LLMs), code error detection has extended beyond traditional IDE diagnostics to context-sensitive debugging in educational scenarios. However, existing approaches lack large-scale datasets, multi-error analysis, and unified error taxonomies. To address this, we introduce PyMETA, a large-scale Python code error classification dataset of 48,646 student submissions, with single-error labels for all samples and a diagnostic subset of 97 expert-annotated multi-error samples. The dataset uses a three-level hierarchical taxonomy, from a binary error/no-error split down to 14 fine-grained error types grounded in Python's official exception hierarchy. We evaluate multi-level classification tasks on two finetuned models and four LLMs with prompting, comparing their classification performance and runtime cost. For multi-error prompting, the best model, Gemini 2.5 Pro, achieves 81.8% macro F1 under the "contains" criterion. We observe that: 1) prompted LLMs still underperform finetuned smaller models; 2) models exhibit significant disparities across error types; 3) most LLMs over-classify code as Logic Error, with GPT-3.5 showing the highest Logic Error Overprediction Rate and Gemini 2.5 Pro the lowest. Our work establishes a data foundation and provides insights for LLM-based code error research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyMETA supplies a 48k-sample Python student error dataset with exception-grounded hierarchy and a small multi-error subset, but the main labels rest on unvalidated interpreter output.

read the letter

The paper's real value is the scale and structure of PyMETA: 48,646 student submissions labeled at three levels using Python's own exception hierarchy, plus 97 expert-annotated multi-error cases. That combination fills a stated gap in existing resources for code error work.

It does a few things cleanly. The taxonomy is tied directly to interpreter output rather than invented categories, the multi-error subset lets them test a "contains" criterion, and the evaluations compare fine-tuned models against prompted LLMs on both accuracy and cost. The observations about Logic Error over-prediction and performance gaps across error types are straightforward to check.

The soft spot is the label quality on the large set. The 48k single-error labels come from running the interpreter and taking the first exception, but the paper only has expert validation on the 97-sample slice. If many submissions contain multiple errors, the dominant label could be wrong at scale, and nothing in the abstract or stress-test note shows an agreement study or error-mode analysis for the bulk of the data. That assumption carries most of the dataset's claimed utility.

This is a resource paper aimed at people building or benchmarking educational code tools and LLM debuggers. Readers who need labeled Python student data will find it usable as a starting point, though they will probably want to spot-check the labels themselves.

It deserves a serious referee. The contribution is the data itself, not a new method, and the label-validation gap is fixable with additional experiments. Send it out.

Referee Report

2 major / 2 minor

Summary. The paper introduces PyMETA, a dataset of 48,646 Python student submissions with single-error labels generated via Python interpreter execution and a three-level hierarchical taxonomy (binary error/no-error down to 14 fine-grained types based on Python's official exception hierarchy). It includes a 97-sample expert-annotated multi-error diagnostic subset and reports evaluations of two finetuned models plus four LLMs on multi-level classification tasks, including observations on performance disparities and Logic Error overprediction.

Significance. If the interpreter-derived labels prove reliable at scale, the dataset would supply a substantial, reproducible resource for code error classification benchmarks in educational settings, with its hierarchical structure and partial multi-error coverage addressing gaps in prior work. The interpreter-based approach and reported model comparisons (e.g., Gemini 2.5 Pro at 81.8% macro F1 under 'contains' criterion) provide concrete starting points for future LLM-based debugging research.

major comments (2)

[Dataset Construction] Dataset Construction section: The central claim that the 48,646 samples supply reliably labeled single-error data rests on Python interpreter output identifying the dominant error. No large-scale expert validation, agreement metrics, or systematic analysis of multi-error mislabeling is reported for this set (only the 97-sample subset receives expert annotation), leaving the label quality assumption untested at the scale that defines the dataset's utility as a benchmark.
[Experiments] Experiments section (multi-error evaluation): The 97-sample multi-error subset is used to support claims about 'contains' criterion performance, but its small size limits statistical power and generalizability of conclusions about model handling of multiple errors relative to the 48k single-error scale.

minor comments (2)

[Abstract] Abstract: The description of label generation could explicitly state the single-error assumption and its potential failure modes to better frame the contribution.
The paper would benefit from a table summarizing the 14 fine-grained error types with their Python exception mappings for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, acknowledging the limitations raised while clarifying the design choices in our labeling and evaluation approach.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: The central claim that the 48,646 samples supply reliably labeled single-error data rests on Python interpreter output identifying the dominant error. No large-scale expert validation, agreement metrics, or systematic analysis of multi-error mislabeling is reported for this set (only the 97-sample subset receives expert annotation), leaving the label quality assumption untested at the scale that defines the dataset's utility as a benchmark.

Authors: The 48,646 labels are produced by executing each submission in the Python interpreter and recording the first exception raised, which supplies a deterministic, reproducible signal for the error that actually occurs. This method was chosen precisely to scale labeling without relying on subjective human judgment for the full set. We agree that the absence of large-scale expert validation leaves open the possibility that some samples contain multiple errors where only the dominant one is recorded. The 97-sample expert-annotated subset was created specifically to examine multi-error cases separately. In the revision we will add an explicit limitations paragraph discussing the interpreter-based labeling assumptions and the trade-off between scale and expert verification. revision: partial
Referee: [Experiments] Experiments section (multi-error evaluation): The 97-sample multi-error subset is used to support claims about 'contains' criterion performance, but its small size limits statistical power and generalizability of conclusions about model handling of multiple errors relative to the 48k single-error scale.

Authors: The 97-sample subset is intentionally small because expert multi-error annotation is resource-intensive; it functions as a diagnostic probe rather than a statistically powered test set. We already qualify the multi-error results with this scale in mind and do not claim they generalize to the full 48k distribution. In the revision we will strengthen the caveats around statistical power and explicitly frame the multi-error results as preliminary, while noting expansion of this subset as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction and evaluations are self-contained

full rationale

The paper introduces a new dataset (PyMETA) whose single-error labels are produced by direct execution against the external Python interpreter and official exception hierarchy; the 97-sample expert subset is used only for diagnostics, not to derive or fit the main labels. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain. Model evaluations are standard benchmarking against external LLMs and finetuned baselines. All load-bearing steps (label generation, taxonomy definition, performance reporting) rest on independent external mechanisms rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Contribution is empirical dataset creation and benchmarking rather than theoretical derivation; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5788 in / 1118 out tokens · 53732 ms · 2026-06-30T04:47:20.023853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Locate the specific programming problem on the online judge using the Question ID
[2]

Paste the student’s answer into the judge, ob- serve the first error, and increment the corre- sponding error-type count in the annotation spreadsheet
[3]

Fixonlythat single error (referring to the ex- pected answer as a reference), and verify that re-execution no longer produces the same er- ror at the same location
[4]

Repeat steps 2–3 until no explicit runtime or compile-time errors remain
[5]

A.6 Use of AI Assistants AI assistants were used solely for translation and grammar checking during the preparation of this manuscript

If the problem includes test cases, run them: failure to pass all test cases indicates a Logic Error; passing all test cases marks the sample as complete. A.6 Use of AI Assistants AI assistants were used solely for translation and grammar checking during the preparation of this manuscript. No AI-generated content was included in the scientific contributio...

2023
[6]

A.8 Model Performance Evaluation Metrics and Prompt Templates We evaluate model performance usingprecision, recall, andF1-scoreunder two modes:Contains andTop-1

are both Python-based works built on the AOJ ITP1 44 introductory problems. A.8 Model Performance Evaluation Metrics and Prompt Templates We evaluate model performance usingprecision, recall, andF1-scoreunder two modes:Contains andTop-1. Precision (Contains).For each class c that ap- pears in the gold labels: TPc =|{i|c∈y gold i andc∈y pred i }|,(4) FPc =...
[7]

First, see the student's code to check if the code has explicit errors (Syntax Error, Name Error, Type Error, Indentation Error, UnboundLocal Error, Key Error, Index Error, EOF Error, Value Error, Tab Error)
[8]

If there is no explicit errors, see the question description, the student's code, and compare with the expected answer to check the logic of code
[9]

0 No error

If the logic is wrong, there is an logic error; Else, there is no error. 16 17Your goal is to classify the error into a predefined taxonomy of Python error types. Please output only the label number and label name, separated by a space, e.g., "0 No error" or "3 NameError". 18 19## Definition of Error Type Categories (Label number and names) 200: No error ...
[10]

Examine the student's code for explicit errors (Syntax Error, Name Error, Type Error , Indentation Error, Unbound Local Error, Key Error, Index Error, EOF Error, Value Error, Tab Error)
[11]

If no explicit error is found, compare the student's code with the expected answer to check the logic
[12]

Write a function that returns the square of a number

If the logic is incorrect, classify as Logic Error. Otherwise, classify as No Error. 16Use this reasoning process internally to justify the final prediction. 17 18--- 19#Definition of Error Type Categories (Label numbers and names) 200: No Error: The code runs successfully with no explicit or Logic Error. 211: Logic Error: The code has no explicit error b...

[1] [1]

Locate the specific programming problem on the online judge using the Question ID

[2] [2]

Paste the student’s answer into the judge, ob- serve the first error, and increment the corre- sponding error-type count in the annotation spreadsheet

[3] [3]

Fixonlythat single error (referring to the ex- pected answer as a reference), and verify that re-execution no longer produces the same er- ror at the same location

[4] [4]

Repeat steps 2–3 until no explicit runtime or compile-time errors remain

[5] [5]

A.6 Use of AI Assistants AI assistants were used solely for translation and grammar checking during the preparation of this manuscript

If the problem includes test cases, run them: failure to pass all test cases indicates a Logic Error; passing all test cases marks the sample as complete. A.6 Use of AI Assistants AI assistants were used solely for translation and grammar checking during the preparation of this manuscript. No AI-generated content was included in the scientific contributio...

2023

[6] [6]

A.8 Model Performance Evaluation Metrics and Prompt Templates We evaluate model performance usingprecision, recall, andF1-scoreunder two modes:Contains andTop-1

are both Python-based works built on the AOJ ITP1 44 introductory problems. A.8 Model Performance Evaluation Metrics and Prompt Templates We evaluate model performance usingprecision, recall, andF1-scoreunder two modes:Contains andTop-1. Precision (Contains).For each class c that ap- pears in the gold labels: TPc =|{i|c∈y gold i andc∈y pred i }|,(4) FPc =...

[7] [7]

First, see the student's code to check if the code has explicit errors (Syntax Error, Name Error, Type Error, Indentation Error, UnboundLocal Error, Key Error, Index Error, EOF Error, Value Error, Tab Error)

[8] [8]

If there is no explicit errors, see the question description, the student's code, and compare with the expected answer to check the logic of code

[9] [9]

0 No error

If the logic is wrong, there is an logic error; Else, there is no error. 16 17Your goal is to classify the error into a predefined taxonomy of Python error types. Please output only the label number and label name, separated by a space, e.g., "0 No error" or "3 NameError". 18 19## Definition of Error Type Categories (Label number and names) 200: No error ...

[10] [10]

Examine the student's code for explicit errors (Syntax Error, Name Error, Type Error , Indentation Error, Unbound Local Error, Key Error, Index Error, EOF Error, Value Error, Tab Error)

[11] [11]

If no explicit error is found, compare the student's code with the expected answer to check the logic

[12] [12]

Write a function that returns the square of a number

If the logic is incorrect, classify as Logic Error. Otherwise, classify as No Error. 16Use this reasoning process internally to justify the final prediction. 17 18--- 19#Definition of Error Type Categories (Label numbers and names) 200: No Error: The code runs successfully with no explicit or Logic Error. 211: Logic Error: The code has no explicit error b...