arxiv: 2604.22774 · v1 · submitted 2026-04-01 · 💻 cs.CY · cs.AI· cs.CV· cs.LG

Recognition: no theorem link

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Jin Seong , Wencke Liermann , Minho Kim , Jong-hun Shin , Soojong Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:39 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CVcs.LG

keywords handwritten math OCRvision-language modelsover-correctionsemantic evaluationmulti-line solutionseducational AIrubric gradingFERMAT dataset

0 comments

The pith

VLMs often rewrite student handwritten math solutions instead of transcribing them, hiding errors, and PINK is a new metric that penalizes this over-correction to better match human judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluation of handwritten math OCR relies on lexical metrics like BLEU that ignore semantic accuracy across multiple lines of student work. The paper shows vision-language models routinely fix apparent errors in these transcriptions rather than reproducing the original student steps. This over-correction conceals the mistakes that educational systems need to detect and address. The authors introduce PINK, which uses an LLM to apply rubric-based grading while subtracting credit for any added corrections. On the FERMAT dataset, this produces major ranking changes among 15 VLMs, and human experts prefer PINK outputs 55 percent of the time versus 39.5 percent for BLEU.

Core claim

Vision-language models exhibit a systematic failure mode of over-correction when transcribing multi-line handwritten mathematics: instead of faithfully reproducing a student's work, they rewrite expressions to remove errors, thereby masking the very reasoning flaws an assessment must identify. PINK (Penalized INK-based score) counters this by using an LLM to perform rubric-based semantic grading that explicitly deducts for over-correction, producing scores that align more closely with human experts than lexical baselines and reversing model rankings on the FERMAT dataset.

What carries the argument

PINK (Penalized INK-based score), a semantic metric that applies LLM-driven rubric grading to multi-line transcriptions and subtracts points for any semantic or notational changes that were not present in the original student work.

If this is right

Models such as GPT-4o receive sharply lower scores under PINK than under BLEU because of their tendency to rewrite student steps.
Gemini 2.5 Flash ranks highest for faithful transcription once over-correction is penalized.
Educational AI systems that rely on VLM OCR will miss student misconceptions unless the evaluation metric penalizes rewriting.
Lexical metrics alone are insufficient for multi-line math because they cannot distinguish helpful cleanup from error concealment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training VLMs with an auxiliary objective that rewards exact reproduction of visible strokes rather than semantic cleanup could reduce over-correction at the source.
The same penalization approach could be adapted to other faithful-transcription tasks such as handwritten code or diagram labeling.
Deploying PINK in live educational platforms would surface student errors that current VLM pipelines currently suppress.
Extending the rubric inside PINK to include step-by-step logical validity could further tighten alignment with actual learning outcomes.

Load-bearing premise

The LLM that performs the rubric grading inside PINK itself avoids over-correction or other biases when judging semantic fidelity.

What would settle it

A controlled human study in which experts rate the same set of transcriptions and find that PINK scores do not correlate more strongly with their judgments of faithfulness than BLEU scores do.

Figures

Figures reproduced from arXiv: 2604.22774 by Jin Seong, Jong-hun Shin, Minho Kim, Soojong Lim, Wencke Liermann.

**Figure 1.** Figure 1: BLEU’s Blind Spots in Multi-line Handwritten Math OCR. (a) BLEU penalizes semantically equivalent expressions, failing to capture their mathematical meaning. (b) BLEU assigns a high score even when the OCR misses a key part of the solution, treating a critical omission as insignificant. These cases highlight BLEU’s lexical limitation and motivate PINK, a semantic and recognition-aware metric that additi… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed PINK scoring system. Step 1: a VLM OCR output over-corrects a student error. Step 2: the auto-grading system assigns a seemingly high score. Step 3: comparison with the oracle score reveals an over-correction event. Step 4: penalties are applied to compute the final PINK score. level PINK scores (r=0.99, τ=0.90). Further analyses confirming stability across repeated runs and promp… view at source ↗

**Figure 3.** Figure 3: Distribution of over-correction instances across rubric items for 15 models, sorted by Penalized Score (Top is Best). Over-corrections are heavily concentrated in the final two stages, particularly in higher-performing models, consistent with autoregressive context accumulation overriding visual evidence. Where and When Does Over-Correction Occur? Our analysis reveals that over-correction is not a rando… view at source ↗

**Figure 5.** Figure 5: Ranking Reversal of VLMs. The leaderboard based on conventional BLEU scores (left) is dramatically overturned when re-evaluated with our PINK metric (right), which penalizes over-correction. Models like GPT-4o drop significantly, while Gemini 2.5 Flash ascends nine places to the top rank, showing that existing metrics miss the crucial dimension of faithfulness. correction. The results in [PITH_FULL_IMA… view at source ↗

**Figure 6.** Figure 6: Over-Correction Rate: Comparison between GPT-4o (red) and Gemini 2.5 Flash (green). logically consistent transcriptions that nevertheless severely distort the student’s original work. For completeness, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Human evaluation results. (a) Experts showed a clear overall preference for PINK over BLEU. (b) PINK was consistently preferred across all score ranges, demonstrating its robustness. university-level mathematical knowledge and an average of over two years of experience in grading mathematics assignments. The evaluation was performed on 200 samples extracted from the FERMAT dataset, transcribed by Ovis2-8… view at source ↗

**Figure 8.** Figure 8: Ranking stability under different penalty [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Model ranking trajectories under varying [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 11.** Figure 11: illustrates the interface used for our human expert study. The evaluation protocol was designed to simulate a realistic grading scenario while ensuring blind comparison: 1. Simulated Grading Scenario: Annotators are presented with a math problem and a “Student Solution.” Although this text is the VLM’s OCR transcription, it is presented as a student’s answer. Annotators are instructed to act as math teac… view at source ↗

**Figure 12.** Figure 12: Case Study of Over-Correction. (Top) Simple-Level: InternVL3-8B locally fixes the student’s error (sin → tan), inflating the score from 34 to 99. (Bottom) Critical-Level: GPT-4o ignores the student’s erroneous reasoning steps and provides a correct summary, inflating the score from 14 to 100. auto-graded scores: they fail to distinguish between faithful transcription and cases where the model quietly “fix… view at source ↗

**Figure 13.** Figure 13: Visualizing Attention Maps and Penalty Application. We visualize the attention mechanism at the moment of generating the target token, conditioned on the preceding integral context. (Top) InternVL3-8B generates the mathematically correct ‘tan’ (Over-Correction). The attention map shows no activation on the handwritten ‘sin’, verifying that the contextual pressure caused the model to treat visual evidence … view at source ↗

**Figure 14.** Figure 14: Over-correction frequency across all 15 tested VLM models. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Ranking changes after applying the overcorrection penalty. The PINK score leads to a significant reordering of models, highlighting the impact of penalizing unfaithful corrections. std ≤ 0.006, confirming that our metric is robust to prompt wording. F Prompt-Based Mitigation Attempt To investigate whether over-correction can be reduced at inference time, we augment the OCR prompt with explicit faithful… view at source ↗

**Figure 16.** Figure 16: Auto-Grading Prompt 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Auto-Labeling Prompt 19 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: OCR Prompt OCR Prompt (Mitigated Ver.) System Prompt: User Prompt: You are a math assistant specializing in extracting mathematical question and answer content from handwritten images of math problems by middle or high school students. Your task is to analyze the given Image, which contains the handwritten math QuestionAnswer pair, and convert it into LaTeX format. Follow the specific instructions provid… view at source ↗

**Figure 19.** Figure 19: Mitigated OCR Prompt 20 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs over-correct student errors in multi-line math handwriting, and PINK is a new metric that penalizes this and tracks human judgment better than BLEU.

read the letter

The main point is that current VLMs do not just transcribe multi-line handwritten math; they often rewrite it to look correct, which hides the exact mistakes an educational system needs to catch. The paper documents this on the FERMAT dataset across 15 models and shows clear ranking reversals once you stop using BLEU. GPT-4o drops because it fixes aggressively, while Gemini 2.5 Flash stays closer to the original student work. PINK adds an explicit penalty for that behavior through LLM rubric grading and gets 55% human preference versus BLEU's 39.5% in their study. That is the useful new piece: a practical way to score faithfulness rather than surface similarity for this specific use case. The work is the first systematic look at multi-line handwritten math OCR instead of single expressions, and the human validation gives it some grounding. The soft spot is exactly the one the stress-test flags. PINK depends on an LLM to decide semantic correctness and apply the penalty, yet the paper already shows that architecturally similar models over-correct. If the grader LLM does the same thing, the metric could quietly normalize errors and produce misleading rankings. The abstract does not spell out the rubric prompts or any checks against grader bias, so the full paper needs to demonstrate that this loop was closed. Dataset composition and exact penalty formula also matter for reproducibility. This is worth a serious referee for people building tutoring or assessment tools that rely on accurate error detection. It is not a foundational theoretical advance, but it identifies a concrete failure mode and gives a workable alternative to BLEU. I would send it to review with the expectation that the LLM-grader circularity gets addressed directly.

Referee Report

2 major / 1 minor

Summary. The paper claims that VLMs exhibit over-correction when transcribing multi-line handwritten math solutions (hiding student errors), introduces PINK as an LLM-based rubric grading metric that explicitly penalizes this behavior, reports ranking reversals versus BLEU on the FERMAT dataset (e.g., GPT-4o heavily penalized while Gemini 2.5 Flash ranks highest), and shows human experts prefer PINK (55.0%) over BLEU (39.5%).

Significance. If the central claims hold after addressing the grader validation, the work would be significant for educational AI: it identifies a previously under-studied failure mode in multi-line math OCR and supplies a semantic metric that better matches human judgment than lexical baselines, with potential to improve automated assessment of student reasoning.

major comments (2)

[PINK metric and evaluation protocol] The PINK metric definition relies on an LLM for rubric-based grading to detect semantic correctness and penalize over-correction, but the manuscript provides no validation that this grader avoids the same over-correction bias demonstrated for VLMs (which share core architectures with the grader LLM); this assumption is load-bearing for the reported ranking reversals and human-preference results.
[Human studies] The human expert preference study (55.0% for PINK vs. 39.5% for BLEU) is presented without details on participant count, annotation protocol, inter-rater reliability, or statistical significance testing; these omissions prevent assessment of whether the alignment claim is robust.

minor comments (1)

[Abstract and dataset description] The composition, size, and diversity of the FERMAT dataset (number of multi-line samples, error types, student demographics) are not described, which would strengthen the generalizability claims for the 15-VLM evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive review. Their comments have helped us identify areas where the manuscript can be strengthened. We address each major comment below.

read point-by-point responses

Referee: [PINK metric and evaluation protocol] The PINK metric definition relies on an LLM for rubric-based grading to detect semantic correctness and penalize over-correction, but the manuscript provides no validation that this grader avoids the same over-correction bias demonstrated for VLMs (which share core architectures with the grader LLM); this assumption is load-bearing for the reported ranking reversals and human-preference results.

Authors: We thank the referee for pointing out this important consideration. The PINK metric is designed with explicit instructions in the LLM prompt to detect and penalize over-correction by comparing the transcription against the original student work for semantic fidelity. However, we acknowledge that without explicit validation of the grader, the results may be questioned. In the revised version, we will add a new subsection validating the LLM grader by having it evaluate a set of synthetic over-correction examples and reporting agreement with human experts. This will support the robustness of the reported rankings and preferences. revision: yes
Referee: [Human studies] The human expert preference study (55.0% for PINK vs. 39.5% for BLEU) is presented without details on participant count, annotation protocol, inter-rater reliability, or statistical significance testing; these omissions prevent assessment of whether the alignment claim is robust.

Authors: We apologize for not including sufficient details on the human study in the original submission. We will revise the manuscript to provide a full description of the participant recruitment (number of experts), the exact annotation protocol used for the preference study, calculations for inter-rater reliability, and the statistical tests applied to the preference percentages. These additions will allow for a better assessment of the claim's robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: PINK is an independently defined LLM-rubric metric evaluated against human judgments

full rationale

The paper defines PINK explicitly as a new semantic metric that applies LLM-based rubric grading to penalize detected over-correction in VLM transcriptions of multi-line math. No equations or derivations reduce the metric to its own outputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing claims rest on self-citations whose content is unverified. The reported ranking reversals and human preference (55% vs 39.5%) are presented as empirical outcomes of applying the metric, not as tautological consequences of its definition. The potential bias of the LLM grader itself is a separate correctness concern, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that over-correction is a measurable failure mode and that LLM-based rubric grading is valid for semantic assessment.

axioms (1)

domain assumption LLM can perform reliable rubric-based grading of math solutions
PINK relies on LLM for semantic evaluation.

pith-pipeline@v0.9.0 · 5569 in / 1095 out tokens · 31417 ms · 2026-05-13T22:39:36.964973+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xue- hui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Student Solution

Simulated Grading Scenario:Annotators are presented with a math problem and a “Student Solution.” Although this text is the VLM’s OCR transcription, it is presented as a student’s answer. Annotators are instructed to act as math teachers and grade the solution based on the provided Reference Answer

work page
[3]

Score A”and“Score B

Blind Metric Comparison:After assigning their own score (0-10 scale), annotators are shown two automated scores labeled“Score A”and“Score B.”These correspond to the PINK and BLEU scores (normalized to the same scale). Crucially, the assignment of A/B ❓ Question (Common) Factorise 4y² − 12y + 9. 🧑‍🎓 Student Solution Observe 4y² = (2y)², 9 = 3² and 12y = 2 ...

work page
[4]

sin” for “tan

Preference Selection:Finally, experts select which score (A or B) better reflects the true quality and faithfulness of the transcription compared to their own expert judgment. Also, annotators were recruited from our institu- tion and compensated through regular employment. They were fully informed about the research objec- tives and how their evaluations...

work page
[5]

Analyze each expression character by character to identify any errors in formulas, calculations, or transformations, and provide a detailed justification explaining exactly where and why points were awarded or deducted

work page
[6]

In line 3, the expression 2x+3 should be 2x-3, where the + sign at position 3 is incorrect

Include specific numbers, equations, or character positions in justification when necessary (e.g., "In line 3, the expression 2x+3 should be 2x-3, where the + sign at position 3 is incorrect")

work page
[7]

Total score must be an integer between 0-100

work page
[8]

scores": [ {

Output only the JSON format below. No additional text. Output Format: { "scores": [ {"item": "Core Formula/Principle Identification", "score": 0~20, "justification": "..."}, {"item": "Condition/Boundary Application", "score": 0~20, "justification": "..."}, {"item": "Calculation Accuracy (Early-Mid)", "score": 0~20, "justification": "..."}, {"item": "Calcu...

work page
[9]

GT (Ground Truth): The original correct LaTeX mathematical expression

work page
[10]

Prediction: The OCR-predicted LaTeX mathematical expression Your task is to:

work page
[11]

Identify all format differences using the categories below

work page
[12]

Identify all recognition errors using the types below

work page
[13]

Note the specific locations and nature of differences for the analysis

work page
[14]

LaTeX format difference only

Classify based on what you found: - If only format differences found → "LaTeX format difference only" - If only recognition errors found → "Recognition error only" - If both found → "Both LaTeX and recognition errors" - If neither found → "No errors" Format Difference Categories: - Math Mode Difference: $ x $ vs $ x $ vs x - Spacing Difference: \quad vs...

work page