pith. sign in

arxiv: 2607.00274 · v1 · pith:UKDNNVQAnew · submitted 2026-06-30 · 💻 cs.CL · cs.AI

SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

Pith reviewed 2026-07-02 18:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords SEFORA corpusUniMatch evaluationLLM feedbackinstructor annotationsstudent essay feedbackwriting supportautomatic feedback generationreference-based evaluation
0
0 comments X

The pith

LLMs match instructor feedback on student essays at no more than 0.4 F1 across 74 tests

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEFORA, a public corpus of 564 college essay drafts paired with 8,240 real instructor annotations across multiple genres, and introduces UniMatch, an evaluation method that breaks feedback into units, measures semantic match to instructor criteria, and uses optimal alignment to compute precision, recall, and F1. It tests multiple LLMs in 74 configurations and finds none exceeds 0.4 F1, with models failing to surface the comments instructors treat as most important and showing worse results as output length grows. A sympathetic reader cares because writing feedback is a major driver of student improvement yet remains labor-intensive to produce at scale, so evidence that current models fall short of instructor alignment indicates automated systems cannot yet substitute for human input without further advances.

Core claim

SEFORA supplies the first large public collection of authentic instructor inline feedback on multi-draft college writing together with prompts, rubrics, and scores, while UniMatch provides a reference-based metric that segments generated feedback, scores unit-level semantic correspondence against instructor-derived criteria, and applies optimal matching to produce interpretable F1 scores; experiments across LLMs show maximum performance of 0.4 F1, revealing systematic difficulty identifying prioritized feedback and degradation as models generate longer responses.

What carries the argument

UniMatch, a reference-based evaluation framework that segments feedback into units, scores their semantic correspondence under instructor-derived criteria, and aligns them via optimal matching to yield precision, recall, and F1

If this is right

  • Current LLMs cannot yet produce feedback that aligns with instructor priorities at usable levels.
  • Automated writing support at classroom scale will require new methods beyond existing generation approaches.
  • Models need explicit training signals for identifying high-priority feedback points rather than maximizing volume.
  • Performance limits appear tied to output length, implying constraints on how much feedback any single model response can reliably deliver.
  • The SEFORA corpus now supplies a public benchmark for measuring future progress on instructor-aligned feedback generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 0.4 F1 ceiling could serve as a practical threshold for deciding when human review remains necessary in deployed writing tools.
  • UniMatch's matching procedure might be extended to other open-ended instructional tasks such as code review or lab report comments.
  • If the gold-standard annotations prove stable across institutions, the corpus could support supervised fine-tuning experiments that target priority detection specifically.
  • The observed length degradation suggests testing whether shorter, targeted prompts or multi-turn interaction improve alignment without increasing total output.

Load-bearing premise

The collected instructor annotations in SEFORA form a reliable gold standard for the feedback that should be given in actual classrooms.

What would settle it

An LLM configuration that achieves F1 above 0.4 on held-out SEFORA data under the same segmentation and matching procedure, or evidence that different instructors assign substantially different priority annotations to the same drafts.

Figures

Figures reproduced from arXiv: 2607.00274 by Carolina Gustafson, Diane Litman, Gayle Rogers, Norah Almousa, Raquel Coelho, Shayan Peyghambari Oskoui, Xiang Lorraine Li, Zhaoyi Joey Hou.

Figure 1
Figure 1. Figure 1: Overview of the UNIMATCH evaluation pipeline. For each paragraph, LLM-generated feedback is segmented into units and compared with instructor feedback units. The resulting semantic similarity scores are used to compute an optimal matching between in￾structor and model feedback units, producing the final evaluation metrics. In writing instruction, producing such feedback is labor-intensive, and its cost at … view at source ↗
Figure 2
Figure 2. Figure 2: An annotated draft from SEFORA, illustrating the dataset’s span-anchored, fine-grained instructor feedback: sticky-note comments paired with color-coded highlights. Pink highlights mark especially effective passages, yellow highlights indicate issues needing attention, and green highlights mark ideas worth developing in a subsequent draft. Full annotation conventions are documented in §A.2. materials are p… view at source ↗
Figure 3
Figure 3. Figure 3: Agreement between annotator-averaged scores (rows, rounded to the model’s scale) and gemini-3.1-flash-lite-preview scores (columns); darker cells hold more pairs. Intensity along the diag￾onal indicates strong agreement; the low-score corner is darkest as most feedback-unit pairs are unrelated and score low. Automatic similarity scorer validation. Exist￾ing similarity methods fall short on this task: lexic… view at source ↗
Figure 4
Figure 4. Figure 4: Matching feedback A and B: boxes mark segments; solid lines = high similarity, dashed = low. test the pipeline end to end, we selected 20 para￾graphs, each with three raw LLM feedback texts (§5) that UNIMATCH scores at well-separated lev￾els (F1 ≈ 0, 0.5, 1), and had a writing expert inde￾pendently rank the three by quality against the refer￾ence feedback.11 Across all 60 texts, UNIMATCH F1 correlates stro… view at source ↗
Figure 5
Figure 5. Figure 5: Precision, recall, and F1 across the 74 config￾urations of §5, sorted by number of generated feedback units (x-axis is ordinal and not to scale). The strip marks the prompt template: the constrained template (dark) and the unconstrained (light). Precision and F1 fall as volume rises while recall does not compensate. show a subset due to space constraints, but the pattern matches the full results. On the si… view at source ↗
Figure 6
Figure 6. Figure 6: Recruitment email for students [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Recruitment email for instructor part 1. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recruitment email for instructor part 2. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Stages of the deterministic PDF parsing pipeline (§A.3). Each annotated submission is converted to a unified JSON representation preserving paragraph segmentation and annotation anchoring. C Feedback Unit Similarity Pipeline C.1 Automatic similarity metrics. In our experiment and setting, lexical-overlap mea￾sures such as BLEU (Papineni et al., 2002) and chrF (Popovic´, 2015) yield Pearson correlations be￾… view at source ↗
Figure 10
Figure 10. Figure 10: Feedback segmentation annotation guideline (page 1/1). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Feedback similarity annotation guideline (page 1/3). [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Feedback similarity annotation guideline (page 2/3). [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Feedback similarity annotation guideline (page 3/3). [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of guidance on F1. Each column is one model under a given condition (Base: zero-shot, no rubric; Few: few-shot; R: rubric); the top row adds feedback-category guidance (G) and the bottom row omits it. Guidance improves F1 in every column. Darker cells indicate higher F1. 0.20 0.26 0.28 0.21 0.28 0.27 R B Few G Llama Mistral Mistral (a) Rubric 0.26 0.32 0.37 0.27 0.31 0.36 F Z G R Llama Qwen GPT (b)… view at source ↗
Figure 15
Figure 15. Figure 15: Effect of rubric and few-shot prompting on [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: F1 across condition (top header) and prompt variant (rows). V1 denotes the unconstrained prompt (§F); V2 denotes the default single-unit prompt used throughout the main experiments [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Unconstrained prompt template (free-form paragraph-level feedback with no constraint on the number of [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Default single-unit prompt template used throughout the main experiments (§ [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
read the original abstract

Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora capture how instructors actually deliver feedback in real classrooms, and no reliable method measures whether generated feedback aligns with what an instructor would write. We address both. SEFORA is a public corpus pairing instructor inline feedback with assignment prompts, rubrics, scores, and multi-draft revisions across various college writing genres, comprising 564 drafts and 8,240 instructor annotations. UniMatch is a reference-based evaluation framework for open-ended generation: it segments feedback into feedback units, scores their semantic correspondence under instructor-derived criteria, and aligns them via optimal matching to yield interpretable precision, recall, and F1. Across 74 experimental configurations spanning multiple LLMs, no setting exceeds 0.4 F1. UniMatch reveals that models struggle to identify the feedback instructors would prioritize, and performance degrades as models generate more.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents SEFORA, a corpus of 564 student drafts with 8,240 instructor feedback annotations across college writing genres, and UniMatch, a reference-based evaluation framework that segments feedback units, scores semantic correspondence under instructor-derived criteria, and applies optimal matching to compute precision, recall, and F1. Across 74 LLM configurations, no setting exceeds 0.4 F1; the framework is used to argue that models struggle to identify prioritized instructor feedback and that performance degrades with more generated feedback.

Significance. If the SEFORA annotations prove reliable and UniMatch tracks human judgments of alignment, the work supplies a public resource for studying real instructor feedback and demonstrates a concrete performance ceiling for current LLMs on this task, which could usefully direct research toward better alignment with instructor priorities. The public corpus itself is a clear asset for the community.

major comments (3)
  1. [Abstract / Corpus Construction] Abstract and Corpus Construction section: the description of the 8,240 annotations supplies no collection protocol details or inter-annotator agreement statistics, which are load-bearing for treating the annotations as a reliable gold standard against which the 0.4 F1 ceiling is measured.
  2. [UniMatch Framework] UniMatch Framework section: no validation study is reported that correlates UniMatch scores (segmentation + semantic scoring + optimal matching) with human judgments of feedback alignment, leaving open whether the low F1 values reflect model behavior or a mismatch in the evaluation procedure itself.
  3. [Results] Results section (74 configurations): the claim that no setting exceeds 0.4 F1 and that performance degrades with more generated feedback lacks reported statistical significance tests or confidence intervals, making it impossible to assess whether the ceiling is robust or an artifact of sampling.
minor comments (1)
  1. [UniMatch Framework] Notation for feedback units and matching criteria could be clarified with an explicit example in the UniMatch section to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Corpus Construction] Abstract and Corpus Construction section: the description of the 8,240 annotations supplies no collection protocol details or inter-annotator agreement statistics, which are load-bearing for treating the annotations as a reliable gold standard against which the 0.4 F1 ceiling is measured.

    Authors: We agree that additional protocol details are required. The revised Corpus Construction section will include a complete description of the data collection process, instructor recruitment, course contexts, assignment types, and annotation guidelines. Because each draft received feedback from only its own course instructor as part of normal teaching, multiple independent annotations per item do not exist and conventional IAA statistics cannot be computed. We will state this limitation explicitly and discuss its implications for the gold-standard status of the annotations. revision: yes

  2. Referee: [UniMatch Framework] UniMatch Framework section: no validation study is reported that correlates UniMatch scores (segmentation + semantic scoring + optimal matching) with human judgments of feedback alignment, leaving open whether the low F1 values reflect model behavior or a mismatch in the evaluation procedure itself.

    Authors: We recognize that an explicit correlation study between UniMatch scores and human alignment judgments would increase confidence in the metric. The present work prioritizes introducing the framework and applying it at scale; a dedicated validation study was outside the original scope. In revision we will expand the UniMatch section with a detailed rationale for each component (segmentation rules, instructor-derived semantic criteria, and optimal matching) and will add an explicit limitations paragraph noting the absence of direct human correlation data. A small-scale validation pilot could be added if the editor considers it essential for acceptance. revision: partial

  3. Referee: [Results] Results section (74 configurations): the claim that no setting exceeds 0.4 F1 and that performance degrades with more generated feedback lacks reported statistical significance tests or confidence intervals, making it impossible to assess whether the ceiling is robust or an artifact of sampling.

    Authors: We agree that statistical support is needed. The revised Results section will report 95% bootstrap confidence intervals around all F1 scores and will include appropriate non-parametric tests (e.g., Wilcoxon signed-rank with correction) to evaluate whether observed differences across the 74 configurations, including the degradation trend with feedback volume, are statistically significant. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central results consist of F1 scores obtained by running UniMatch on LLM outputs against the externally collected SEFORA corpus of 8,240 instructor annotations. UniMatch performs segmentation, semantic scoring under instructor-derived criteria, and optimal matching to produce precision/recall/F1; these steps are defined procedurally and applied to independent human data rather than being fitted to or defined in terms of the target LLM performance. No equations, self-citations, or ansatzes are shown that would reduce the reported maximum F1 ≤ 0.4 to a tautology or fitted input. The evaluation chain remains externally anchored to the human-annotated benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the representativeness of the new corpus and the validity of the UniMatch procedure; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Instructor annotations collected for SEFORA accurately reflect real classroom feedback priorities
    The paper treats the corpus as the reference standard for what instructors would write.

pith-pipeline@v0.9.1-grok · 5737 in / 1222 out tokens · 33119 ms · 2026-07-02T18:45:57.371585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint arXiv:2006.14799 , year=

    Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799. Tuhin Chakrabarty, Philippe Laban, Divyansh Agar- wal, Smaranda Muresan, and Chien-Sheng Wu. 2024. Art or artifice? large language models and the false promise of creativity. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–34. Scott A Crossley,...

  2. [2]

    InPro- ceedings of the eighth workshop on innovative use of NLP for building educational applications, pages 22–31

    Building a large annotated corpus of learner english: The nus corpus of learner english. InPro- ceedings of the eighth workshop on innovative use of NLP for building educational applications, pages 22–31. Daniel Deutsch, Rotem Dror, and Dan Roth. 2022. On the limitations of reference-free evaluations of gen- erated text. InProceedings of the 2022 Conferen...

  3. [3]

    ACM Transactions on Computing Education (TOCE), 19(1):1–43

    A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education (TOCE), 19(1):1–43. Avraham N Kluger and Angelo DeNisi. 1996. The effects of feedback interventions on performance: a historical review, a meta-analysis, and a preliminary feedback intervention theory.Psychological bulletin, 1...

  4. [4]

    i really need feedback to learn:

    Cityu corpus of essay drafts of english lan- guage learners: a corpus of textual revision in second language writing.Language Resources and Evalua- tion, 49(3):659–683. Jiaqi Li, Ming Liu, Bing Qin, and Ting Liu. 2022. A survey of discourse parsing.Frontiers of Computer Science, 16(5):165329. Jiwei Li, Rumeng Li, and Eduard Hovy. 2014. Recur- sive deep mo...

  5. [5]

    Qwen2.5 Technical Report

    How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2122–2132. Xiaoqiang Luo. 2005. On coreference resolution perfor- mance metrics. InProceedings of human language technology confere...

  6. [6]

    InProceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 1–11

    Semeval-2015 task 1: Paraphrase and seman- tic similarity in twitter (pit). InProceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 1–11. Yu-Chun Grace Yen, Joy O Kim, and Brian P Bailey

  7. [7]

    BERTScore: Evaluating Text Generation with BERT

    Decipher: an interactive visualization tool for interpreting unstructured design feedback from multiple providers. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–13. Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text gener- ation.Advances in neural information processing...

  8. [8]

    Extracts every character in the box

    Body text extraction Deterministic. Extracts every character in the box

  9. [9]

    Highlight anchoring Quads → covered text → matched span; comment stored

  10. [10]

    Color recovery Local render→HEX→named color

  11. [11]

    Sticky-note anchoring Coordinate-radius context → located in body → injected

  12. [12]

    Paragraph structure Inferred from inter-block spacing (visual cues)

  13. [13]

    Semantic content of instructor feedback on a student-written paragraph:

    Identifier removal Positional headers/footers; regex metadata; human- checked. Figure 9: Stages of the deterministic PDF parsing pipeline (§A.3). Each annotated submission is converted to a unified JSON representation preserving paragraph segmentation and annotation anchoring. C Feedback Unit Similarity Pipeline C.1 Automatic similarity metrics. In our ex...

  14. [14]

    Consider varying your transitions between sentences (e.g., instead of using “and”, try using words like however, meanwhile, etc.)$

  15. [15]

    newly waxed floors

    What does the phrase “newly waxed floors” mean? Is it necessary to describe the floor in such detail?$ Sample 3: Your vivid descriptions set the stage beautifully. Keep focusing on using sensory details to engage readers further.$ Great job showing how the simple drill captured your interest. Consider expanding on why it resonated with you specifically.$ ...

  16. [16]

    Identify the SINGLE most important feedback focus in the TARGET PARAGRAPH

  17. [17]

    Determine which ONE feedback category best matches that focus

  18. [18]

    ### Feedback Categories ### Choose exactly ONE category based on the feedback focus you identify

    Write ONE short piece of constructive feedback that reflects that focus. ### Feedback Categories ### Choose exactly ONE category based on the feedback focus you identify. - Task Constraints: Feedback about how the writing aligns with the assignment or rubric. - Concepts: Feedback that supports understanding or development of a key concept. - Elaboration: ...