{"paper":{"title":"Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Abigail Victoria Gurin Schleifer, Asaf Salman, Beata Beigman Klebanov, Giora Alexandron, Moriah Ariely","submitted_at":"2026-05-08T12:12:01Z","abstract_excerpt":"Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, G"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The ground-truth scores assigned by a single biology education expert accurately capture the nuanced interpretation required for mid-range responses and serve as a stable reference for measuring model agreement.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AI short-answer scorers show mid-range quality degradation that lessens with more task-specific adaptation, while human agreement stays stable across the quality spectrum.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f7dcdaccfa0846d465370acc20c4ed56bb8b09c56df2aa8d39b3bea677bc0917"},"source":{"id":"2605.07647","kind":"arxiv","version":2},"verdict":{"id":"70d07868-503d-4bc1-bec9-c21174b24040","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-11T01:49:18.189353Z","strongest_claim":"All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.","one_line_summary":"AI short-answer scorers show mid-range quality degradation that lessens with more task-specific adaptation, while human agreement stays stable across the quality spectrum.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The ground-truth scores assigned by a single biology education expert accurately capture the nuanced interpretation required for mid-range responses and serve as a stable reference for measuring model agreement.","pith_extraction_headline":"AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.07647/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T10:22:02.892223Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-20T05:37:24.567154Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T16:01:18.811606Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T11:39:25.813866Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"63f34cb53498a775763a6c54ef8df48691aa70beb276570c7cff6a8362ac2187"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}