The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

Ali Keramati; Mark Warschauer

arxiv: 2606.10327 · v1 · pith:4TD6PZU7new · submitted 2026-06-09 · 💻 cs.CL · cs.LG

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

Ali Keramati , Mark Warschauer This is my paper

Pith reviewed 2026-06-27 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords automated essay scoringsequential fine-tuningLLaMAdiscourse elementscurriculum learningLoRAparameter-efficient fine-tuning

0 comments

The pith

Sequential fine-tuning of LLaMA along discourse order beats independent training and a 70B baseline on essay scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the order of fine-tuning tasks on interdependent discourse elements matters for automated essay scoring with LLaMA-3.1-8B. It compares a sequential curriculum that trains on lead, then position, then claim, then evidence, then conclusion against independent per-task models and a randomized multi-task schedule, all using LoRA. The sequential approach produces the strongest results overall, with notable gains on evidence and conclusion scoring that exceed those from a much larger general-purpose model. A sympathetic reader would care because most AES systems still score elements in isolation, which can reduce coherence, and the work shows a practical way to capture dependencies with modest compute. The findings point to curriculum design as a lever for making smaller models competitive in educational NLP tasks.

Core claim

The authors establish that sequential fine-tuning of LLaMA-3.1-8B on the PERSUADE 2.0 corpus, progressing through discourse elements in the order lead-position-claim-evidence-conclusion, yields the highest performance across tasks. This curriculum achieves F1 scores of 65% on evidence and 87% on conclusion with accuracies of 63% and 85%, surpassing both independent task-specific models and a randomized multi-task baseline. It also outperforms a general-purpose LLaMA-70B model on conclusion scoring despite using far fewer parameters. The work concludes that curriculum design aligned with discourse structure improves coherence and that small task-optimized models offer a scalable alternative t

What carries the argument

The sequential fine-tuning curriculum that progressively adapts LLaMA-3.1-8B via LoRA to discourse elements in their natural order to capture interdependencies for coherent AES.

If this is right

Sequential fine-tuning aligned with discourse structure produces the strongest overall AES results on the tested corpus.
Independent training underperforms the sequential curriculum on most discourse elements.
Randomized multi-task training improves position scoring but is less consistent on other elements.
Small task-optimized models can exceed a general-purpose 70B model on specific scoring tasks.
Releasing templates and implementation details supports reproduction and further curriculum experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ordering principle could be tested on other multi-step reasoning tasks where element dependencies are known in advance.
Alternative sequences or hybrid joint-sequential schedules might prove better on essay corpora with different rhetorical structures.
The released templates could enable direct comparison of curriculum effects across additional languages or assessment rubrics.

Load-bearing premise

The assumption that the specific sequence lead then position then claim then evidence then conclusion aligns with the true interdependencies that must be modeled to produce coherent AES judgments.

What would settle it

A head-to-head experiment on PERSUADE 2.0 showing that any other fixed order, the reverse order, or a joint multi-task model achieves equal or higher F1 and accuracy on evidence and conclusion would falsify the claim that this sequential curriculum is superior.

Figures

Figures reproduced from arXiv: 2606.10327 by Ali Keramati, Mark Warschauer.

**Figure 2.** Figure 2: Prompt Formatting Template for Lead Statement Evaluation [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss for the Sequential Fine-Tuning method. The model is trained progressively [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss for the Independent Fine-Tuning method. Each colored line represents a [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss for the Randomized Fine-Tuning method. The single black line represents [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sequential curriculum beats independent and random on this AES task, but the results don't isolate whether the specific discourse order is what drives the gains.

read the letter

The main point is that fine-tuning LLaMA-3.1-8B sequentially on lead, position, claim, evidence, then conclusion produces higher F1 and accuracy than either separate per-task models or a shuffled multi-task run, and it even beats the 70B baseline on the conclusion element.

The work is straightforward: they run the three curricula with LoRA and 4-bit quantization on PERSUADE 2.0 and report the numbers. That head-to-head is new for discourse-element AES, and releasing the templates makes it easy for others to replicate or extend. The practical angle—that a small optimized model can compete with a much larger general one on some subtasks—is useful for anyone doing low-compute adaptation.

The limitation is exactly the one in the stress-test note. Only one fixed sequence is tried, so the paper cannot yet claim that aligning with discourse structure is the operative factor rather than sequential training in general. A reverse order or a different progression would be needed to pin that down. The abstract also gives no error bars, no statistical tests, and no dataset split details, which leaves the strength of the differences unclear from the text alone.

This is for people working on curriculum design in educational NLP or multi-task LLM fine-tuning. A reader who needs concrete numbers on AES subtasks or wants to try similar ordering experiments will find something to build on.

I would send it to peer review. The core comparison is simple enough that referees can evaluate the missing controls and ask for the extra ablations without much trouble.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates task-aware fine-tuning of LLaMA-3.1-8B with LoRA and 4-bit quantization for automated essay scoring (AES) on interdependent discourse elements in the PERSUADE 2.0 corpus. It compares three curricula—sequential fine-tuning in the fixed order lead→position→claim→evidence→conclusion, independent per-task models, and randomized multi-task training—claiming that the sequential approach yields the strongest results (e.g., 65% F1 and 63% accuracy on evidence; 87% F1 and 85% accuracy on conclusion), outperforming both the independent baseline and a general-purpose LLaMA-70B model on conclusion despite the latter's larger size. The work concludes that curriculum design aligned with discourse structure improves coherence and that small optimized models can compete with much larger LLMs.

Significance. If the central empirical findings hold, the paper would contribute to educational NLP by providing evidence that training order can materially affect performance on interdependent tasks and that parameter-efficient fine-tuning enables smaller models to rival larger general-purpose LLMs in domain-specific applications. The explicit release of templates and implementation details strengthens the work by supporting reproducibility and enabling follow-on research on curriculum design.

major comments (2)

[Abstract and Results] Abstract and Results section: The central claim that 'curriculum design aligned with discourse structure' is the operative factor is not supported by the experiments. Only independent and randomized (shuffled) baselines are compared; no alternative fixed orders (e.g., reverse order or claim-first) are tested. This prevents distinguishing the effect of the specific discourse-aligned sequence from the effect of any progressive sequential schedule.
[Experimental Setup and Results] Experimental Setup and Results: No error bars, statistical significance tests, run-to-run variance, or dataset split details are reported for the F1/accuracy numbers (e.g., 65% F1 on evidence, 87% F1 on conclusion). Without these, the claim that sequential training 'yields the strongest overall results' cannot be rigorously evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate to strengthen the work.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The central claim that 'curriculum design aligned with discourse structure' is the operative factor is not supported by the experiments. Only independent and randomized (shuffled) baselines are compared; no alternative fixed orders (e.g., reverse order or claim-first) are tested. This prevents distinguishing the effect of the specific discourse-aligned sequence from the effect of any progressive sequential schedule.

Authors: We agree that the experiments do not isolate the specific discourse-aligned order from other possible fixed sequential schedules. The randomized baseline controls for order consistency in general, but alternative fixed orders (e.g., reverse) would be required to fully attribute gains to discourse alignment rather than any progressive curriculum. In the revised manuscript we will temper the abstract and results claims to emphasize benefits of fixed sequential training over independent and randomized approaches, while noting the discourse alignment as a motivated design choice rather than a fully isolated causal factor. We will also add this as an explicit limitation. revision: yes
Referee: [Experimental Setup and Results] Experimental Setup and Results: No error bars, statistical significance tests, run-to-run variance, or dataset split details are reported for the F1/accuracy numbers (e.g., 65% F1 on evidence, 87% F1 on conclusion). Without these, the claim that sequential training 'yields the strongest overall results' cannot be rigorously evaluated.

Authors: We acknowledge the absence of error bars, variance across runs, statistical tests, and detailed split information in the reported metrics. These omissions limit the ability to assess robustness. In the revision we will report results from multiple random seeds with standard deviations, include appropriate statistical significance tests between curricula, and expand the experimental setup section with explicit details on train/validation/test splits and any stratification used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical comparisons

full rationale

The paper reports standard empirical results from fine-tuning LLaMA-3.1-8B with LoRA on the PERSUADE 2.0 corpus, comparing three curricula (sequential, independent, randomized) via F1 and accuracy metrics. No equations, parameter fits, self-citations, or derivations are present that reduce any reported outcome to an input quantity defined by the authors. The central claim rests on held-out test performance rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated in the text. The method implicitly relies on the standard assumption that LoRA adapters with 4-bit quantization can adapt LLaMA to the AES tasks without catastrophic forgetting.

axioms (1)

domain assumption LoRA with 4-bit quantization is sufficient to adapt LLaMA-3.1-8B to the five AES subtasks while preserving task-specific performance
Stated as the training method in the abstract without further justification or ablation.

pith-pipeline@v0.9.1-grok · 5792 in / 1504 out tokens · 23841 ms · 2026-06-27T13:22:11.500054+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 4 canonical work pages

[1]

Unveiling the tapestry of automated essay scoring: a comprehensive investigation of accuracy, fairness, and generalizability , year =

Yang, Kaixun and Rakovi\'. Unveiling the tapestry of automated essay scoring: a comprehensive investigation of accuracy, fairness, and generalizability , year =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Education...

work page doi:10.1609/aaai.v38i20.30254
[2]

Hill and Dan Jurafsky and Chris Piech , title =

Dorottya Demszky and Jing Liu and Heather C. Hill and Dan Jurafsky and Chris Piech , title =. Educational Evaluation and Policy Analysis , volume =. 2024 , doi =. https://doi.org/10.3102/01623737231169270 , abstract =

work page doi:10.3102/01623737231169270 2024
[3]

Can Large Language Models Automatically Score Proficiency of Written Essays?

Mansour, Watheq Ahmad and Albatarni, Salam and Eltanbouly, Sohaila and Elsayed, Tamer. Can Large Language Models Automatically Score Proficiency of Written Essays?. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[4]

Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability , journal =

Austin Pack and Alex Barrett and Juan Escalante , keywords =. Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.caeai.2024.100234 , url =

work page doi:10.1016/j.caeai.2024.100234 2024
[5]

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Stahl, Maja and Biermann, Leon and Nehring, Andreas and Wachsmuth, Henning. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

2024
[6]

Crossley and Y

S.A. Crossley and Y. Tian and P. Baffour and A. Franklin and M. Benner and U. Boser , keywords =. A large-scale corpus for assessing written argumentation: PERSUADE 2.0 , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.asw.2024.100865 , url =

work page doi:10.1016/j.asw.2024.100865 2024
[7]

2025 , eprint=

How well can LLMs Grade Essays in Arabic? , author=. 2025 , eprint=

2025
[8]

Proceedings of The Tenth Pan-Commonwealth Forum on Open Learning (PCF10) , year=

Automated essay scoring (AES) systems: Opportunities and challenges for open and distance education , author=. Proceedings of The Tenth Pan-Commonwealth Forum on Open Learning (PCF10) , year=
[9]

Journal of Learning Analytics , volume=

The effects of explanations in automated essay scoring systems on student trust and motivation , author=. Journal of Learning Analytics , volume=. 2023 , publisher=

2023
[10]

Research Methods in Applied Linguistics , volume=

Exploring the potential of using an AI language model for automated essay scoring , author=. Research Methods in Applied Linguistics , volume=. 2023 , publisher=

2023
[11]

Artificial Intelligence Review , volume=

A survey on deep learning-based automated essay scoring and feedback generation , author=. Artificial Intelligence Review , volume=. 2025 , publisher=

2025
[12]

arXiv preprint arXiv:2102.13136 , year=

Automated essay scoring using efficient transformer-based language models , author=. arXiv preprint arXiv:2102.13136 , year=

arXiv
[13]

International Conference on Artificial Intelligence in Education , pages=

Neural automated essay scoring considering logical structure , author=. International Conference on Artificial Intelligence in Education , pages=. 2023 , organization=

2023
[14]

Zeitschrift f

A hierarchical rater model approach for integrating automated essay scoring models , author=. Zeitschrift f. 2024 , publisher=

2024
[15]

Computers and Education: Artificial Intelligence , volume=

Can AI provide useful holistic essay scoring? , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

2024
[16]

arXiv preprint arXiv:2109.11728 , year=

AES systems are both overstable and oversensitive: Explaining why and proposing defenses , author=. arXiv preprint arXiv:2109.11728 , year=

arXiv
[17]

Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6--10, 2020, Proceedings, Part I 21 , pages=

Robust neural automated essay scoring using item response theory , author=. Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6--10, 2020, Proceedings, Part I 21 , pages=. 2020 , organization=

2020
[18]

arXiv preprint arXiv:2008.01441 , year=

Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring , author=. arXiv preprint arXiv:2008.01441 , year=

arXiv 2008
[19]

arXiv preprint arXiv:2502.08450 , year=

Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring , author=. arXiv preprint arXiv:2502.08450 , year=

arXiv
[20]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020
[21]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021

[1] [1]

Unveiling the tapestry of automated essay scoring: a comprehensive investigation of accuracy, fairness, and generalizability , year =

Yang, Kaixun and Rakovi\'. Unveiling the tapestry of automated essay scoring: a comprehensive investigation of accuracy, fairness, and generalizability , year =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Education...

work page doi:10.1609/aaai.v38i20.30254

[2] [2]

Hill and Dan Jurafsky and Chris Piech , title =

Dorottya Demszky and Jing Liu and Heather C. Hill and Dan Jurafsky and Chris Piech , title =. Educational Evaluation and Policy Analysis , volume =. 2024 , doi =. https://doi.org/10.3102/01623737231169270 , abstract =

work page doi:10.3102/01623737231169270 2024

[3] [3]

Can Large Language Models Automatically Score Proficiency of Written Essays?

Mansour, Watheq Ahmad and Albatarni, Salam and Eltanbouly, Sohaila and Elsayed, Tamer. Can Large Language Models Automatically Score Proficiency of Written Essays?. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024

[4] [4]

Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability , journal =

Austin Pack and Alex Barrett and Juan Escalante , keywords =. Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.caeai.2024.100234 , url =

work page doi:10.1016/j.caeai.2024.100234 2024

[5] [5]

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Stahl, Maja and Biermann, Leon and Nehring, Andreas and Wachsmuth, Henning. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

2024

[6] [6]

Crossley and Y

S.A. Crossley and Y. Tian and P. Baffour and A. Franklin and M. Benner and U. Boser , keywords =. A large-scale corpus for assessing written argumentation: PERSUADE 2.0 , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.asw.2024.100865 , url =

work page doi:10.1016/j.asw.2024.100865 2024

[7] [7]

2025 , eprint=

How well can LLMs Grade Essays in Arabic? , author=. 2025 , eprint=

2025

[8] [8]

Proceedings of The Tenth Pan-Commonwealth Forum on Open Learning (PCF10) , year=

Automated essay scoring (AES) systems: Opportunities and challenges for open and distance education , author=. Proceedings of The Tenth Pan-Commonwealth Forum on Open Learning (PCF10) , year=

[9] [9]

Journal of Learning Analytics , volume=

The effects of explanations in automated essay scoring systems on student trust and motivation , author=. Journal of Learning Analytics , volume=. 2023 , publisher=

2023

[10] [10]

Research Methods in Applied Linguistics , volume=

Exploring the potential of using an AI language model for automated essay scoring , author=. Research Methods in Applied Linguistics , volume=. 2023 , publisher=

2023

[11] [11]

Artificial Intelligence Review , volume=

A survey on deep learning-based automated essay scoring and feedback generation , author=. Artificial Intelligence Review , volume=. 2025 , publisher=

2025

[12] [12]

arXiv preprint arXiv:2102.13136 , year=

Automated essay scoring using efficient transformer-based language models , author=. arXiv preprint arXiv:2102.13136 , year=

arXiv

[13] [13]

International Conference on Artificial Intelligence in Education , pages=

Neural automated essay scoring considering logical structure , author=. International Conference on Artificial Intelligence in Education , pages=. 2023 , organization=

2023

[14] [14]

Zeitschrift f

A hierarchical rater model approach for integrating automated essay scoring models , author=. Zeitschrift f. 2024 , publisher=

2024

[15] [15]

Computers and Education: Artificial Intelligence , volume=

Can AI provide useful holistic essay scoring? , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

2024

[16] [16]

arXiv preprint arXiv:2109.11728 , year=

AES systems are both overstable and oversensitive: Explaining why and proposing defenses , author=. arXiv preprint arXiv:2109.11728 , year=

arXiv

[17] [17]

Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6--10, 2020, Proceedings, Part I 21 , pages=

Robust neural automated essay scoring using item response theory , author=. Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6--10, 2020, Proceedings, Part I 21 , pages=. 2020 , organization=

2020

[18] [18]

arXiv preprint arXiv:2008.01441 , year=

Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring , author=. arXiv preprint arXiv:2008.01441 , year=

arXiv 2008

[19] [19]

arXiv preprint arXiv:2502.08450 , year=

Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring , author=. arXiv preprint arXiv:2502.08450 , year=

arXiv

[20] [20]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020

[21] [21]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021