Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
S em E val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
citing papers explorer
-
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.