LLM-as-a-judge validity in physics assessment depends more on the task than the model

· 2026 · physics.ed-ph · arXiv 2603.14732

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking is valid is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and anchored conditions. We distinguish absolute accuracy from rank-order agreement, since a marking system can match the distribution of human marks while failing to order responses by quality. Across task types, performance is sharply task-dependent. For blind university exam questions ($n=771$) and secondary and university structured questions ($n=1151$), models show robust rank-order agreement with human markers (Spearman $\rho > 0.6$), with official solutions reducing error and strengthening agreement. False solutions degrade absolute accuracy, showing that models defer to provided references, but leave rank-ordering intact. Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking and adding a mark scheme does not improve rank-order agreement. Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but rank-order agreement remains near-zero. For code-based plot elements ($n=1400$), models achieve high rank-order agreement ($\rho > 0.84$) with near-linear calibration. Across all task types, validity tracks the structure of the assessment task - the extent to which marks can be mapped to explicit, observable grading features - and the reliability of the human benchmark, rather than raw model capability.

representative citing papers

Safeguarding LLM Agents from Misalignment through Provenance Analysis

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

ProvenanceGuard applies a provenance-based framework to detect three types of misalignment in LLM agent tool calls, cutting error rates on misaligned traces from 42.9% to 1.8% on one benchmark while lowering unnecessary interventions.

citing papers explorer

Showing 1 of 1 citing paper.

Safeguarding LLM Agents from Misalignment through Provenance Analysis cs.CL · 2026-05-01 · unverdicted · none · ref 47 · internal anchor
ProvenanceGuard applies a provenance-based framework to detect three types of misalignment in LLM agent tool calls, cutting error rates on misaligned traces from 42.9% to 1.8% on one benchmark while lowering unnecessary interventions.

LLM-as-a-judge validity in physics assessment depends more on the task than the model

fields

years

verdicts

representative citing papers

citing papers explorer