pith. machine review for the scientific record. sign in

arxiv: 2604.22774 · v1 · submitted 2026-04-01 · 💻 cs.CY · cs.AI· cs.CV· cs.LG

Recognition: no theorem link

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Jin Seong , Wencke Liermann , Minho Kim , Jong-hun Shin , Soojong Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:39 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CVcs.LG
keywords handwritten math OCRvision-language modelsover-correctionsemantic evaluationmulti-line solutionseducational AIrubric gradingFERMAT dataset
0
0 comments X

The pith

VLMs often rewrite student handwritten math solutions instead of transcribing them, hiding errors, and PINK is a new metric that penalizes this over-correction to better match human judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluation of handwritten math OCR relies on lexical metrics like BLEU that ignore semantic accuracy across multiple lines of student work. The paper shows vision-language models routinely fix apparent errors in these transcriptions rather than reproducing the original student steps. This over-correction conceals the mistakes that educational systems need to detect and address. The authors introduce PINK, which uses an LLM to apply rubric-based grading while subtracting credit for any added corrections. On the FERMAT dataset, this produces major ranking changes among 15 VLMs, and human experts prefer PINK outputs 55 percent of the time versus 39.5 percent for BLEU.

Core claim

Vision-language models exhibit a systematic failure mode of over-correction when transcribing multi-line handwritten mathematics: instead of faithfully reproducing a student's work, they rewrite expressions to remove errors, thereby masking the very reasoning flaws an assessment must identify. PINK (Penalized INK-based score) counters this by using an LLM to perform rubric-based semantic grading that explicitly deducts for over-correction, producing scores that align more closely with human experts than lexical baselines and reversing model rankings on the FERMAT dataset.

What carries the argument

PINK (Penalized INK-based score), a semantic metric that applies LLM-driven rubric grading to multi-line transcriptions and subtracts points for any semantic or notational changes that were not present in the original student work.

If this is right

  • Models such as GPT-4o receive sharply lower scores under PINK than under BLEU because of their tendency to rewrite student steps.
  • Gemini 2.5 Flash ranks highest for faithful transcription once over-correction is penalized.
  • Educational AI systems that rely on VLM OCR will miss student misconceptions unless the evaluation metric penalizes rewriting.
  • Lexical metrics alone are insufficient for multi-line math because they cannot distinguish helpful cleanup from error concealment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training VLMs with an auxiliary objective that rewards exact reproduction of visible strokes rather than semantic cleanup could reduce over-correction at the source.
  • The same penalization approach could be adapted to other faithful-transcription tasks such as handwritten code or diagram labeling.
  • Deploying PINK in live educational platforms would surface student errors that current VLM pipelines currently suppress.
  • Extending the rubric inside PINK to include step-by-step logical validity could further tighten alignment with actual learning outcomes.

Load-bearing premise

The LLM that performs the rubric grading inside PINK itself avoids over-correction or other biases when judging semantic fidelity.

What would settle it

A controlled human study in which experts rate the same set of transcriptions and find that PINK scores do not correlate more strongly with their judgments of faithfulness than BLEU scores do.

Figures

Figures reproduced from arXiv: 2604.22774 by Jin Seong, Jong-hun Shin, Minho Kim, Soojong Lim, Wencke Liermann.

Figure 1
Figure 1. Figure 1: BLEU’s Blind Spots in Multi-line Hand￾written Math OCR. (a) BLEU penalizes semantically equivalent expressions, failing to capture their mathe￾matical meaning. (b) BLEU assigns a high score even when the OCR misses a key part of the solution, treating a critical omission as insignificant. These cases high￾light BLEU’s lexical limitation and motivate PINK, a semantic and recognition-aware metric that additi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed PINK scoring system. Step 1: a VLM OCR output over-corrects a student error. Step 2: the auto-grading system assigns a seemingly high score. Step 3: comparison with the oracle score reveals an over-correction event. Step 4: penalties are applied to compute the final PINK score. level PINK scores (r=0.99, τ=0.90). Further anal￾yses confirming stability across repeated runs and promp… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of over-correction instances across rubric items for 15 models, sorted by Penal￾ized Score (Top is Best). Over-corrections are heav￾ily concentrated in the final two stages, particularly in higher-performing models, consistent with autoregres￾sive context accumulation overriding visual evidence. Where and When Does Over-Correction Occur? Our analysis reveals that over-correction is not a rando… view at source ↗
Figure 5
Figure 5. Figure 5: Ranking Reversal of VLMs. The leader￾board based on conventional BLEU scores (left) is dra￾matically overturned when re-evaluated with our PINK metric (right), which penalizes over-correction. Models like GPT-4o drop significantly, while Gemini 2.5 Flash ascends nine places to the top rank, showing that exist￾ing metrics miss the crucial dimension of faithfulness. correction. The results in [PITH_FULL_IMA… view at source ↗
Figure 6
Figure 6. Figure 6: Over-Correction Rate: Comparison between GPT-4o (red) and Gemini 2.5 Flash (green). logically consistent transcriptions that nevertheless severely distort the student’s original work. For completeness, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Human evaluation results. (a) Experts showed a clear overall preference for PINK over BLEU. (b) PINK was consistently preferred across all score ranges, demonstrating its robustness. university-level mathematical knowledge and an average of over two years of experience in grad￾ing mathematics assignments. The evaluation was performed on 200 samples extracted from the FER￾MAT dataset, transcribed by Ovis2-8… view at source ↗
Figure 8
Figure 8. Figure 8: Ranking stability under different penalty [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model ranking trajectories under varying [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: illustrates the interface used for our hu￾man expert study. The evaluation protocol was designed to simulate a realistic grading scenario while ensuring blind comparison: 1. Simulated Grading Scenario: Annotators are presented with a math problem and a “Student Solution.” Although this text is the VLM’s OCR transcription, it is presented as a student’s answer. Annotators are instructed to act as math teac… view at source ↗
Figure 12
Figure 12. Figure 12: Case Study of Over-Correction. (Top) Simple-Level: InternVL3-8B locally fixes the student’s error (sin → tan), inflating the score from 34 to 99. (Bottom) Critical-Level: GPT-4o ignores the student’s erroneous reasoning steps and provides a correct summary, inflating the score from 14 to 100. auto-graded scores: they fail to distinguish between faithful transcription and cases where the model quietly “fix… view at source ↗
Figure 13
Figure 13. Figure 13: Visualizing Attention Maps and Penalty Application. We visualize the attention mechanism at the moment of generating the target token, conditioned on the preceding integral context. (Top) InternVL3-8B generates the mathematically correct ‘tan’ (Over-Correction). The attention map shows no activation on the handwritten ‘sin’, verifying that the contextual pressure caused the model to treat visual evidence … view at source ↗
Figure 14
Figure 14. Figure 14: Over-correction frequency across all 15 tested VLM models. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ranking changes after applying the over￾correction penalty. The PINK score leads to a signif￾icant reordering of models, highlighting the impact of penalizing unfaithful corrections. std ≤ 0.006, confirming that our metric is robust to prompt wording. F Prompt-Based Mitigation Attempt To investigate whether over-correction can be re￾duced at inference time, we augment the OCR prompt with explicit faithful… view at source ↗
Figure 16
Figure 16. Figure 16: Auto-Grading Prompt 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Auto-Labeling Prompt 19 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: OCR Prompt OCR Prompt (Mitigated Ver.) System Prompt: User Prompt: You are a math assistant specializing in extracting mathematical question and answer content from handwritten images of math problems by middle or high school students. Your task is to analyze the given Image, which contains the handwritten math Question￾Answer pair, and convert it into LaTeX format. Follow the specific instructions provid… view at source ↗
Figure 19
Figure 19. Figure 19: Mitigated OCR Prompt 20 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
read the original abstract

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that VLMs exhibit over-correction when transcribing multi-line handwritten math solutions (hiding student errors), introduces PINK as an LLM-based rubric grading metric that explicitly penalizes this behavior, reports ranking reversals versus BLEU on the FERMAT dataset (e.g., GPT-4o heavily penalized while Gemini 2.5 Flash ranks highest), and shows human experts prefer PINK (55.0%) over BLEU (39.5%).

Significance. If the central claims hold after addressing the grader validation, the work would be significant for educational AI: it identifies a previously under-studied failure mode in multi-line math OCR and supplies a semantic metric that better matches human judgment than lexical baselines, with potential to improve automated assessment of student reasoning.

major comments (2)
  1. [PINK metric and evaluation protocol] The PINK metric definition relies on an LLM for rubric-based grading to detect semantic correctness and penalize over-correction, but the manuscript provides no validation that this grader avoids the same over-correction bias demonstrated for VLMs (which share core architectures with the grader LLM); this assumption is load-bearing for the reported ranking reversals and human-preference results.
  2. [Human studies] The human expert preference study (55.0% for PINK vs. 39.5% for BLEU) is presented without details on participant count, annotation protocol, inter-rater reliability, or statistical significance testing; these omissions prevent assessment of whether the alignment claim is robust.
minor comments (1)
  1. [Abstract and dataset description] The composition, size, and diversity of the FERMAT dataset (number of multi-line samples, error types, student demographics) are not described, which would strengthen the generalizability claims for the 15-VLM evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive review. Their comments have helped us identify areas where the manuscript can be strengthened. We address each major comment below.

read point-by-point responses
  1. Referee: [PINK metric and evaluation protocol] The PINK metric definition relies on an LLM for rubric-based grading to detect semantic correctness and penalize over-correction, but the manuscript provides no validation that this grader avoids the same over-correction bias demonstrated for VLMs (which share core architectures with the grader LLM); this assumption is load-bearing for the reported ranking reversals and human-preference results.

    Authors: We thank the referee for pointing out this important consideration. The PINK metric is designed with explicit instructions in the LLM prompt to detect and penalize over-correction by comparing the transcription against the original student work for semantic fidelity. However, we acknowledge that without explicit validation of the grader, the results may be questioned. In the revised version, we will add a new subsection validating the LLM grader by having it evaluate a set of synthetic over-correction examples and reporting agreement with human experts. This will support the robustness of the reported rankings and preferences. revision: yes

  2. Referee: [Human studies] The human expert preference study (55.0% for PINK vs. 39.5% for BLEU) is presented without details on participant count, annotation protocol, inter-rater reliability, or statistical significance testing; these omissions prevent assessment of whether the alignment claim is robust.

    Authors: We apologize for not including sufficient details on the human study in the original submission. We will revise the manuscript to provide a full description of the participant recruitment (number of experts), the exact annotation protocol used for the preference study, calculations for inter-rater reliability, and the statistical tests applied to the preference percentages. These additions will allow for a better assessment of the claim's robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: PINK is an independently defined LLM-rubric metric evaluated against human judgments

full rationale

The paper defines PINK explicitly as a new semantic metric that applies LLM-based rubric grading to penalize detected over-correction in VLM transcriptions of multi-line math. No equations or derivations reduce the metric to its own outputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing claims rest on self-citations whose content is unverified. The reported ranking reversals and human preference (55% vs 39.5%) are presented as empirical outcomes of applying the metric, not as tautological consequences of its definition. The potential bias of the LLM grader itself is a separate correctness concern, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that over-correction is a measurable failure mode and that LLM-based rubric grading is valid for semantic assessment.

axioms (1)
  • domain assumption LLM can perform reliable rubric-based grading of math solutions
    PINK relies on LLM for semantic evaluation.

pith-pipeline@v0.9.0 · 5569 in / 1095 out tokens · 31417 ms · 2026-05-13T22:39:36.964973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xue- hui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, ...

  2. [2]

    Student Solution

    Simulated Grading Scenario:Annotators are presented with a math problem and a “Student Solution.” Although this text is the VLM’s OCR transcription, it is presented as a student’s answer. Annotators are instructed to act as math teachers and grade the solution based on the provided Reference Answer

  3. [3]

    Score A”and“Score B

    Blind Metric Comparison:After assigning their own score (0-10 scale), annotators are shown two automated scores labeled“Score A”and“Score B.”These correspond to the PINK and BLEU scores (normalized to the same scale). Crucially, the assignment of A/B ❓ Question (Common) Factorise 4y² − 12y + 9. 🧑‍🎓 Student Solution Observe 4y² = (2y)², 9 = 3² and 12y = 2 ...

  4. [4]

    sin” for “tan

    Preference Selection:Finally, experts select which score (A or B) better reflects the true quality and faithfulness of the transcription compared to their own expert judgment. Also, annotators were recruited from our institu- tion and compensated through regular employment. They were fully informed about the research objec- tives and how their evaluations...

  5. [5]

    Analyze each expression character by character to identify any errors in formulas, calculations, or transformations, and provide a detailed justification explaining exactly where and why points were awarded or deducted

  6. [6]

    In line 3, the expression 2x+3 should be 2x-3, where the + sign at position 3 is incorrect

    Include specific numbers, equations, or character positions in justification when necessary (e.g., "In line 3, the expression 2x+3 should be 2x-3, where the + sign at position 3 is incorrect")

  7. [7]

    Total score must be an integer between 0-100

  8. [8]

    scores": [ {

    Output only the JSON format below. No additional text. Output Format: { "scores": [ {"item": "Core Formula/Principle Identification", "score": 0~20, "justification": "..."}, {"item": "Condition/Boundary Application", "score": 0~20, "justification": "..."}, {"item": "Calculation Accuracy (Early-Mid)", "score": 0~20, "justification": "..."}, {"item": "Calcu...

  9. [9]

    GT (Ground Truth): The original correct LaTeX mathematical expression

  10. [10]

    Prediction: The OCR-predicted LaTeX mathematical expression Your task is to:

  11. [11]

    Identify all format differences using the categories below

  12. [12]

    Identify all recognition errors using the types below

  13. [13]

    Note the specific locations and nature of differences for the analysis

  14. [14]

    LaTeX format difference only

    Classify based on what you found: - If only format differences found → "LaTeX format difference only" - If only recognition errors found → "Recognition error only" - If both found → "Both LaTeX and recognition errors" - If neither found → "No errors" Format Difference Categories: - Math Mode Difference: $ x $ vs \( x \) vs x - Spacing Difference: \quad vs...