pith. sign in

arxiv: 2605.19043 · v1 · pith:3TR7R62Hnew · submitted 2026-05-18 · 💻 cs.CY · cs.AI· cs.HC

Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs

Pith reviewed 2026-05-20 07:31 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC
keywords automated gradinghandwritten mathematicsvision language modelsrubric evaluationtranscription errorseducational assessmentSTEM courses
0
0 comments X

The pith

Vision LLMs grade handwritten math with high accuracy when transcription and rubric evaluation happen in one step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision-capable large language models can grade real student handwritten mathematics solutions using instructor rubrics. It runs the models on photographic submissions from two university STEM courses and measures agreement with human grades at the level of individual rubric items. Accuracy is high overall. Most mistakes, up to 87 percent in the strongest model, come from the model failing to read or transcribe what is written rather than from misapplying the rubric itself. This distinction matters because it points to a clear path for improvement focused on image interpretation.

Core claim

When a single LLM call performs both transcription of the handwritten work and rubric-based scoring, overall grading accuracy against human ground truth is high, and the large majority of errors in the best model trace to transcription failures rather than to incorrect rubric application. The study also identifies recurring failure patterns such as poor image quality, hallucinated mathematical content, and inconsistent treatment of mathematically equivalent expressions.

What carries the argument

A single LLM call that jointly transcribes the handwritten solution and applies the instructor-defined rubric to produce a grade.

If this is right

  • Because transcription accounts for most errors, targeted improvements in image handling or prompt engineering should raise end-to-end accuracy.
  • Recurring failure types such as hallucinations and mishandling of equivalent expressions give concrete targets for prompt refinement.
  • The same integrated transcription-plus-rubric approach that worked for typed responses now extends to photographic handwritten submissions.
  • The error breakdown supplies practical guidance for deciding when to deploy the system and when to keep human review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If future models or preprocessing steps raise transcription reliability, rubric-application accuracy would become the dominant remaining limit.
  • The same pipeline could be tested on handwritten work in non-STEM subjects where visual or symbolic content is central.
  • A hybrid system that flags low-confidence transcriptions for human inspection would likely reduce unfair grades while preserving scale.

Load-bearing premise

Human-assigned grades serve as reliable ground truth and errors can be cleanly and unambiguously separated into transcription problems versus rubric-application problems.

What would settle it

Re-run the grading pipeline after replacing the LLM's transcription step with an independent high-accuracy OCR system and measure whether the remaining rubric-application errors drop below the current level.

Figures

Figures reproduced from arXiv: 2605.19043 by Craig Zilles, Jacob Levine, Mariana Silva, Matthew West, Miguel Aenlle.

Figure 1
Figure 1. Figure 1: Example question HTML and reference answers in the LLM prompt, with randomized parameters and corresponding correct values highlighted. were asked to compute the projection of one vector onto another. In the second question, students analyzed the planar motion of a rigid body. In both courses, students were able to preview their submission to ensure readability and to crop the image to include only the wor… view at source ↗
Figure 2
Figure 2. Figure 2: Example of student submission and rubric for Course 2 – Transcription Error (TE): The proportion of rubric items with AI–human disagreement associated with transcription errors, including missing text that would satisfy a rubric item, hallucinated text, or mistranscribed ex￾pressions that render a correct solution incorrect. – Rubric Application Error (RAE): The proportion of rubric items with AI–human dis… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of submissions with LLM transcription issues due to blurry and rotated images. GPT models showed lower accuracy with more rubric-related errors (28% vs. 13% for Gemini 3 Flash) and near parity between false positives and neg￾atives, partly because rubric errors tended toward false negatives. Gemini 3 Flash’s accuracy appears sufficient for low-stakes, formative contexts (e.g., prac￾tice problems),… view at source ↗
read the original abstract

Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates vision-capable LLMs for grading handwritten mathematics by integrating transcription and rubric-based assessment into a single LLM call. It tests the approach on real student submissions from two university STEM courses, compares AI rubric-item decisions against human ground truth, reports high overall accuracy, and states that 87% of errors in the best model stem from transcription failures rather than rubric misapplication. Common error modes (image quality, hallucinations, equivalent expressions) are categorized with implications for prompt design and deployment.

Significance. If the central empirical claims hold, the work provides timely evidence that current vision LLMs can support scalable grading of complex handwritten STEM work once transcription succeeds, thereby focusing future improvements on vision components rather than reasoning or rubric logic. Use of authentic student data and instructor rubrics increases ecological validity over synthetic benchmarks.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: the claim that 87% of errors are attributable to transcription failures rather than rubric misapplication is load-bearing for the paper's main conclusion, yet no protocol, decision criteria, or inter-rater reliability measure for this post-hoc partitioning is described. Because transcription and grading occur inside one LLM call, the classification risks subjectivity without explicit documentation.
  2. [Methods/Evaluation section] Methods/Evaluation section: sample sizes (number of submissions, rubric items, and students per course), exact model versions, prompt templates, and any statistical tests or confidence intervals for the accuracy figures are not reported. These details are required to evaluate whether the observed accuracy and error distribution are statistically reliable.
minor comments (2)
  1. [Abstract] Abstract: the specific LLMs tested and the two courses could be named to give readers immediate context.
  2. [Error analysis] Error analysis: a table or set of concrete examples illustrating each categorized error mode (e.g., hallucinated content) would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback, which identifies key areas for improving the transparency and completeness of our work. We will make the suggested revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the claim that 87% of errors are attributable to transcription failures rather than rubric misapplication is load-bearing for the paper's main conclusion, yet no protocol, decision criteria, or inter-rater reliability measure for this post-hoc partitioning is described. Because transcription and grading occur inside one LLM call, the classification risks subjectivity without explicit documentation.

    Authors: We thank the referee for highlighting this important aspect of our error analysis. We will revise the manuscript to include a clear description of the protocol used for partitioning errors. Specifically, we manually compared the LLM's output transcription to the original handwritten image to identify transcription failures. If the transcription was accurate but the rubric application was incorrect, it was classified as a rubric misapplication. We will provide the decision criteria, example cases from our dataset, and note that this was done by the research team with discussion to resolve ambiguities. We acknowledge the potential for subjectivity and will discuss this limitation in the revised version. revision: yes

  2. Referee: [Methods/Evaluation section] sample sizes (number of submissions, rubric items, and students per course), exact model versions, prompt templates, and any statistical tests or confidence intervals for the accuracy figures are not reported. These details are required to evaluate whether the observed accuracy and error distribution are statistically reliable.

    Authors: We will update the Methods and Evaluation sections to report all requested details. This includes the number of submissions, rubric items, and students per course; the precise versions of the vision-capable LLMs employed; the complete prompt templates used in the experiments; and statistical analyses such as confidence intervals or significance tests for the reported accuracy metrics to demonstrate the reliability of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation grounded in external human ground truth

full rationale

The paper reports an empirical evaluation of an LLM grader on real student handwritten submissions from two university courses. Accuracy and error attributions (including the 87% transcription figure) are obtained by direct comparison to human-assigned ground truth at the rubric-item level, with post-hoc categorization of observed errors. No equations, parameter fitting, self-definitional derivations, or load-bearing self-citations appear in the abstract or described pipeline that would reduce any result to prior author inputs by construction. The work extends a prior pipeline but the central claims rest on new data and external benchmarks rather than any reduction to fitted or self-referential quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations, free parameters, or new theoretical entities; claims rest entirely on observational data from student work and human comparisons.

pith-pipeline@v0.9.0 · 5717 in / 999 out tokens · 31891 ms · 2026-05-20T07:31:14.889557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    In: European Conference on Computer Vision

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Conference on Computer Vision. pp. 498–517. Springer (2022)

  2. [2]

    arXiv preprint arXiv:2510.05162 (2025)

    Kortemeyer, G., Caspar, A., Horica, D.: Artificial-intelligence grading assistance for handwritten components of a calculus exam. arXiv preprint arXiv:2510.05162 (2025)

  3. [3]

    Physical Review Physics Education Research20(2), 020144 (2024)

    Kortemeyer, G., Nöhl, J., Onishchuk, D.: Grading assistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study. Physical Review Physics Education Research20(2), 020144 (2024)

  4. [4]

    In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.2

    Levine, J., West, M., Silva, M.: A two-stage llm pipeline for handwrit- ten mathematics autograding. In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.2. p. 1413–1414. SIGCSE TS Auto grading of Handwritten Math Using Vision-Capable LLMs 9 2026, Association for Computing Machinery, New York, NY, USA (2026). https://doi.org...

  5. [5]

    https://doi.org/10.48550/arXiv.2504.05239, arXiv:2504.05239 [cs]

    Li, H., Chu, Y., Yang, K., Copur-Gencturk, Y., Tang, J.: LLM- based Automated Grading with Human-in-the-Loop (Apr 2025). https://doi.org/10.48550/arXiv.2504.05239, arXiv:2504.05239 [cs]

  6. [6]

    https://doi.org/10.48550/arXiv.2408.11728, http://arxiv.org/abs/2408.11728, arXiv:2408.11728 [math]

    Liu, T., Chatain, J., Kobel-Keller, L., Kortemeyer, G., Willwacher, T., Sachan, M.: AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams (Aug 2024). https://doi.org/10.48550/arXiv.2408.11728, http://arxiv.org/abs/2408.11728, arXiv:2408.11728 [math]

  7. [7]

    In: Proceed- ings of the 27th ACM Conference on Innovation and Technology in Com- puter Science Education Vol

    Poulsen, S., Viswanathan, M., Herman, G.L., West, M.: Proof Blocks: Au- togradable Scaffolding Activities for Learning to Write Proofs. In: Proceed- ings of the 27th ACM Conference on Innovation and Technology in Com- puter Science Education Vol. 1. pp. 428–434. ACM, Dublin Ireland (Jul 2022). https://doi.org/10.1145/3502718.3524774

  8. [8]

    In: Proceedings of the 51st ACM Technical Symposium on Computer Science Education

    Silva, M., West, M., Zilles, C.: Measuring the score advantage on asyn- chronous exams in an undergraduate cs course. In: Proceedings of the 51st ACM Technical Symposium on Computer Science Education. p. 873–879. SIGCSE ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3328778.3366859

  9. [9]

    Singh, A., Fry, A., Perelman, A., et al.: OpenAI GPT-5 system card (2025), https://arxiv.org/abs/2601.03267

  10. [10]

    Pattern Recognition153, 110531 (2024)

    Truong, T.N., Nguyen, C.T., Zanibbi, R., Mouchère, H., Nakagawa, M.: A survey on handwritten mathematical expression recognition: The rise of encoder-decoder and GNN models. Pattern Recognition153, 110531 (2024)

  11. [11]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Shi, B., Zhang, B., He, C.: Image over text: Transforming formula recognition evaluation with char- acter detection matching. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19681–19690 (2025)

  12. [12]

    Virtual Event (2021), https://zilles.cs.illinois.edu/papers/paper_pl_splice_2021.pdf

    West, M., Walters, N., Silva, M., Bretl, T., Zilles, C.B.: Integrating Di- verse Learning Tools using the PrairieLearn Platform. Virtual Event (2021), https://zilles.cs.illinois.edu/papers/paper_pl_splice_2021.pdf

  13. [13]

    In: International Conference on Docu- ment Analysis and Recognition

    Xie, Y., Mouchère, H., Simistira Liwicki, F., Rakesh, S., Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions. In: International Conference on Docu- ment Analysis and Recognition. pp. 553–565. Springer (2023)

  14. [14]

    In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.1

    Zhao, C., Fowler, M., Gertner, Y., Poulsen, S., West, M., Silva, M.: Ai-supported grading and rubric refinement for free response questions. In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V.1. p. 1207–1214. SIGCSE TS 2026, Association for Computing Machinery, New York, NY, USA (2026). https://doi.org/10.1145/3770762.3772545

  15. [15]

    In: Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S

    Zhao, C., Silva, M., Poulsen, S.: Language Models are Few-Shot Graders. In: Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S. (eds.) Artificial Intelligence in Education, vol. 15880, pp. 3–16. Springer Na- ture Switzerland, Cham (2025). https://doi.org/10.1007/978-3-031-98459-4_1, https://link.springer.com/10.1007/978-3-031-98459-4_1, series Ti...