Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

Hartwig Grabowski; Michael Canz

arxiv: 2606.08855 · v1 · pith:3VZSA565new · submitted 2026-06-07 · 💻 cs.AI · cs.CV· cs.CY

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

Hartwig Grabowski , Michael Canz This is my paper

Pith reviewed 2026-06-27 18:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.CY

keywords hybrid e-assessmentsemi-automated gradinghandwriting recognitionvision language modelssummative assessmenthigher educationtwo-pass validationpaper-based exams

0 comments

The pith

Vision LLMs with two-pass validation and solution key checks reduce misclassifications in handwritten exam grading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies limits in fully digital exams, such as restricted question formats and legal hurdles for large classes, and proposes keeping paper-based open problems instead. Students enter intermediate results by hand into structured table fields that are later scanned. Vision-capable large language models process these fields, applying two passes of validation and matching outputs to a solution key to cut recognition errors. If successful, this keeps assessments problem-oriented and fair while handling bigger student numbers more efficiently. A sympathetic reader would see value in avoiding the trade-off between authentic testing and practical scale.

Core claim

The central claim is that recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

What carries the argument

Two-pass validation principle with solution key comparison, used to improve accuracy of vision LLMs on structured handwritten answer fields.

If this is right

Paper-based problem-oriented examination tasks can be retained while grading becomes semi-automated.
Misclassifications in recognition of handwritten answers decrease.
Validity and fairness of summative assessments increase.
Scalability of assessments improves for large student cohorts without full digitization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to process other structured handwritten inputs outside university exams.
It suggests a pathway for universities to maintain traditional exam formats while gaining automation benefits.
Similar validation steps might apply to other document processing tasks that mix handwriting and structured layouts.

Load-bearing premise

Vision LLMs can achieve reliable handwritten character recognition under realistic exam conditions when the two-pass validation and solution key comparison are applied.

What would settle it

A collection of real student exam papers processed by the system where the rate of character misclassifications remains high even after two-pass validation and solution key comparison.

Figures

Figures reproduced from arXiv: 2606.08855 by Hartwig Grabowski, Michael Canz.

**Figure 2.** Figure 2: Example of a structured examination page: the free-fall derivation [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Recognition is challenging when letters must be assigned to the correct [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The grading workflow can be decomposed into four stages. Using two [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Structured examination page created in the usual way with a [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Additional example of a structured examination page using [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a conceptual proposal for using vision LLMs to grade structured handwritten exam answers that flags real assessment problems but supplies no tests or data on whether the recognition step works.

read the letter

The main point is that the paper outlines a hybrid workflow—paper exams with structured answer fields scanned and processed by vision LLMs plus two-pass checks against a solution key—but treats the reduction in misclassifications as a given rather than something demonstrated.

It does a solid job laying out the practical drawbacks of fully digital or closed-question formats in large cohorts, including didactic narrowing and legal or organizational hurdles. The suggestion to retain open-ended paper tasks while adding machine-readable structure is a direct response to those constraints and draws on current LLM capabilities without claiming new algorithms.

The soft spot is the total lack of evidence. No error rates, no trials on real student scripts, no handling of typical handwriting variation or scan quality, and no comparison to human grading. The claim that the method improves validity and fairness therefore rests on an assumption about reliable character recognition that the manuscript does not test.

This is for people working on assessment practices or edtech tools who want a concrete workflow idea rather than a finished system. It is not yet at the stage where it would change practice or warrant citation.

A serious editor could reasonably send it for review as a proposal piece, provided the authors are expected to add at least pilot data on recognition accuracy before acceptance.

Referee Report

1 major / 0 minor

Summary. The paper proposes a hybrid e-assessment approach for higher education summative examinations that retains paper-based, problem-oriented tasks while enabling semi-automated grading. Students handwrite structured intermediate results in table fields; these are captured and processed by vision-capable large language models using a two-pass validation principle and comparison against a solution key. The central claim is that this combination can overcome the recognition bottleneck, reduce misclassifications, and thereby improve validity, fairness, and scalability compared to fully digital or partially digital alternatives.

Significance. If the proposed recognition method proves reliable under realistic exam conditions, the hybrid workflow could meaningfully address didactic narrowing and scalability issues in large-cohort assessments while preserving authentic problem-solving formats. The paper correctly identifies practical constraints (organizational, technical, legal) that current e-assessment systems face. However, because the manuscript contains no empirical results, the significance remains prospective rather than demonstrated.

major comments (1)

[Abstract] Abstract and proposed workflow: The assertion that vision LLMs plus two-pass validation and solution-key comparison 'can reduce misclassifications' is presented as the solution to the central technical bottleneck, yet the manuscript supplies no quantitative evidence—no error rates, no test sets of real student handwriting, no comparison against human graders, and no evaluation under realistic scanning or answer-density conditions. This untested assumption is load-bearing for the entire proposal.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We agree that the manuscript presents a conceptual proposal rather than an empirical study, and that the abstract's phrasing requires clarification to avoid implying demonstrated performance. We will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and proposed workflow: The assertion that vision LLMs plus two-pass validation and solution-key comparison 'can reduce misclassifications' is presented as the solution to the central technical bottleneck, yet the manuscript supplies no quantitative evidence—no error rates, no test sets of real student handwriting, no comparison against human graders, and no evaluation under realistic scanning or answer-density conditions. This untested assumption is load-bearing for the entire proposal.

Authors: We acknowledge that the manuscript is a design proposal without empirical evaluation of the recognition accuracy. The statement in the abstract describes the intended mechanism of the two-pass validation and solution-key comparison rather than a validated outcome. We will revise the abstract and introduction to explicitly frame the reduction in misclassifications as a hypothesis to be tested in subsequent empirical work, and to clarify that the current contribution lies in the overall hybrid workflow design and its potential to address didactic and scalability constraints. revision: yes

Circularity Check

0 steps flagged

No circularity; conceptual proposal without derivations or self-referential claims

full rationale

The manuscript is a high-level conceptual proposal for hybrid e-assessment. It identifies limitations of existing approaches and suggests a workflow using vision LLMs with two-pass validation and solution-key comparison. No equations, fitted parameters, derivations, or load-bearing self-citations appear in the provided text. The central claim is stated as a hypothesis rather than derived from internal inputs, making the paper self-contained as a non-mathematical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests primarily on an untested domain assumption about LLM capabilities for handwriting recognition rather than on free parameters or new entities.

axioms (1)

domain assumption Vision-capable LLMs can achieve reliable handwritten character recognition in realistic exam conditions when combined with two-pass validation against a solution key
This is invoked as the solution to the central technical bottleneck without supporting evidence or prior validation cited in the abstract.

pith-pipeline@v0.9.1-grok · 5672 in / 1240 out tokens · 25861 ms · 2026-06-27T18:04:18.189044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Biggs, J. (1996). Enhancing teaching through constructive alignment.Higher Education, 32(3), 347–364. https://doi.org/10.1007/BF00138871 Black, P. & Wiliam, D. (1998). Assessment and classroom learn- ing.Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102 Bommasani, R., et al. (2021). On the opportu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf00138871 1996
[2]

& Schaper, E

https://doi.org/10.1111/j.1467-9280.2006.01693.x Schmees, M., Krüger, M. & Schaper, E. (2013). E-Assessments an Hochschulen: Ein vielschichtiges Thema. InE-Assessments in der Hochschullehre. Einführung, Positionen & Einsatzbeispiele: Psychologie und Gesellschaft(No. 13, pp. 19–32). PL Academic Research: Frankfurt, M. https://doi.org/10.25656/01:12879 Schm...

work page doi:10.1111/j.1467-9280.2006.01693.x 2006

[1] [1]

Biggs, J. (1996). Enhancing teaching through constructive alignment.Higher Education, 32(3), 347–364. https://doi.org/10.1007/BF00138871 Black, P. & Wiliam, D. (1998). Assessment and classroom learn- ing.Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102 Bommasani, R., et al. (2021). On the opportu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf00138871 1996

[2] [2]

& Schaper, E

https://doi.org/10.1111/j.1467-9280.2006.01693.x Schmees, M., Krüger, M. & Schaper, E. (2013). E-Assessments an Hochschulen: Ein vielschichtiges Thema. InE-Assessments in der Hochschullehre. Einführung, Positionen & Einsatzbeispiele: Psychologie und Gesellschaft(No. 13, pp. 19–32). PL Academic Research: Frankfurt, M. https://doi.org/10.25656/01:12879 Schm...

work page doi:10.1111/j.1467-9280.2006.01693.x 2006