pith. sign in

arxiv: 2606.06754 · v1 · pith:NMSB67DWnew · submitted 2026-06-04 · 💻 cs.MA · cs.CL

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

Pith reviewed 2026-06-27 22:43 UTC · model grok-4.3

classification 💻 cs.MA cs.CL
keywords essay scoringmulti-agent debateretrieval-augmented generationLLM evaluationtraining-freeanalytic assessmentrubric calibration
0
0 comments X

The pith

MADRAG decomposes essay scoring into an Advocate-Skeptic-Judge debate augmented with retrieved rubric examples to achieve training-free performance near supervised levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MADRAG as a training-free method that structures LLM evaluation of essays through a three-agent debate: an Advocate highlights strengths, a Skeptic identifies weaknesses, and a Judge aggregates the exchange into a score. The Judge is further grounded by retrieving scored examples that match the rubric criteria, allowing calibration by direct comparison rather than relying solely on internal model knowledge. A sympathetic reader would care because this setup targets the known problems of bias and inconsistency in plain prompting while avoiding the data and compute costs of training custom models. Results indicate clear gains over baselines, with retrieval driving consistency and debate aiding complex traits, suggesting a practical path to reliable automated analytic scoring.

Core claim

MADRAG decomposes evaluation into an interactive Advocate-Skeptic-Judge process augmented with rubric-aligned exemplar retrieval, enabling calibration and improved reasoning that significantly outperforms prompt-based baselines while approaching supervised systems without task-specific training.

What carries the argument

The Advocate-Skeptic-Judge debate where the Judge retrieves and compares against rubric-aligned scored exemplars to aggregate arguments into a final score.

If this is right

  • Structured debate improves reasoning specifically on higher-level traits such as organization and argumentation.
  • Rubric-aligned retrieval supplies the external calibration needed for consistent scoring without fine-tuning.
  • Ablation results establish that both the multi-agent interaction and the retrieval step are required for the observed gains.
  • The same framework can be applied to new essay prompts or traits by swapping only the rubric and example pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to scoring other open-ended student work such as short answers or project reports if suitable exemplars are available.
  • Retrieval quality and coverage of the example database will likely determine how well the system handles unusual or edge-case essays.
  • If the base LLM carries systematic biases into the agent roles, the debate format may reduce but not fully eliminate them.

Load-bearing premise

The interactive Advocate-Skeptic-Judge process together with rubric-aligned exemplar retrieval will produce stable, unbiased scores that generalize beyond the specific test conditions.

What would settle it

Testing MADRAG on a new essay corpus with fresh prompts and comparing its trait-level agreement metrics against human raters to see whether the reported gains over baselines hold or disappear.

Figures

Figures reproduced from arXiv: 2606.06754 by Ali Keramati, Mark Warschauer, Sharad Mehrotra, Shiyuan Zhou.

Figure 1
Figure 1. Figure 1: Overview of the MADRAG scoring pipeline The Supervisor routes each rubric trait to a dedicated debate team, retrieves few-shot exemplars to augment the Judge, and the Judge aggregates the Advocate and Skeptic exchanges together to produce the final trait score. The full sequence of agent messages is provided in Appendix A.8. Baselines. We compare MADRAG against a diverse set of strong baselines spanning su… view at source ↗
Figure 2
Figure 2. Figure 2: Merged ablation results (QWK). Overlapping [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Adocate system prompt. All text preceding the final score marker is stored as the Judge rationale. If the score cannot be parsed, the system records the rationale but flags the score as missing. A.6 Prompt Templates All agent prompts are stored as external template files and rendered at runtime using a shared con￾text dictionary. The context includes the trait name, the full rubric trait serialized as JSON… view at source ↗
Figure 5
Figure 5. Figure 5: Judge system prompt. inally released as part of a Kaggle competition sponsored by the William and Flora Hewlett Foun￾dation. The dataset consists of anonymized En￾glish essays written by students in grades 7–10 in response to eight distinct prompts, each defining a separate essay set. Essay sets vary substantially in genre (persuasive, narrative, and source-dependent response), length, grade level, and sco… view at source ↗
Figure 6
Figure 6. Figure 6: Example multi-agent debate for the Ideas trait (Part I): Advocate initiation and Skeptic rebuttal [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example multi-agent debate for the Ideas trait (Part II): Judge synthesis and final score [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning (Advocate identifying strengths, Skeptic critiquing weaknesses, Judge aggregating arguments) with retrieval-augmented generation using rubric-aligned exemplars. The central claim is that MADRAG significantly outperforms prompt-based LLM baselines while approaching supervised system performance, with ablations attributing calibration gains to retrieval and improved reasoning on higher-level traits to debate.

Significance. If the results hold under broader conditions, this would represent a useful contribution to training-free LLM evaluation by showing how structured interaction and external memory can address bias and instability, offering a practical alternative to supervised methods in settings with limited labeled data.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts performance gains, ablation results, and generalization of the training-free advantage, but provides no metrics, datasets, statistical tests, or experimental details, preventing assessment of the central claim.
  2. [Experiments] Experiments section: Ablation studies are reported only for the specific test conditions described; the absence of cross-dataset, cross-rubric, or cross-LLM robustness numbers is load-bearing for the claim that the interactive Advocate-Skeptic-Judge process with rubric retrieval yields stable scores that generalize.
minor comments (1)
  1. [Method] Method section: The description of argument aggregation by the Judge and the precise retrieval mechanism would benefit from additional detail to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts performance gains, ablation results, and generalization of the training-free advantage, but provides no metrics, datasets, statistical tests, or experimental details, preventing assessment of the central claim.

    Authors: We agree that the abstract would be strengthened by including concrete details. In the revised manuscript we will expand the abstract to report key metrics (e.g., QWK on the ASAP dataset), name the primary datasets and rubrics used, and briefly note the statistical comparisons performed in the experiments section. This change directly addresses the concern while preserving the abstract's brevity. revision: yes

  2. Referee: [Experiments] Experiments section: Ablation studies are reported only for the specific test conditions described; the absence of cross-dataset, cross-rubric, or cross-LLM robustness numbers is load-bearing for the claim that the interactive Advocate-Skeptic-Judge process with rubric retrieval yields stable scores that generalize.

    Authors: The current experiments evaluate the framework on standard essay-scoring benchmarks (ASAP and a second dataset) using established rubrics and the primary LLM backbone. We acknowledge that explicit cross-dataset, cross-rubric, and cross-LLM results would provide stronger evidence for broad generalization. In revision we will (1) add an explicit limitations paragraph clarifying the tested conditions and (2) include a modest additional ablation on a second LLM where compute permits. Full cross-dataset experiments lie outside the present scope but can be noted as future work. revision: partial

Circularity Check

0 steps flagged

No derivation chain or mathematical claims present

full rationale

The paper introduces an empirical multi-agent framework (Advocate-Skeptic-Judge with rubric-aligned retrieval) and supports its claims solely through experimental results, ablation studies, and comparisons to baselines. No equations, first-principles derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. Performance assertions rest on external test conditions rather than any self-referential reduction, making the work self-contained against the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5677 in / 1049 out tokens · 19537 ms · 2026-06-27T22:43:38.197148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 20 canonical work pages

  1. [1]

    Can Large Language Models Automatically Score Proficiency of Written Essays?

    Mansour, Watheq Ahmad and Albatarni, Salam and Eltanbouly, Sohaila and Elsayed, Tamer. Can Large Language Models Automatically Score Proficiency of Written Essays?. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  2. [2]

    Can Neural Networks Automatically Score Essay Traits?

    Mathias, Sandeep and Bhattacharyya, Pushpak. Can Neural Networks Automatically Score Essay Traits?. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2020. doi:10.18653/v1/2020.bea-1.8

  3. [3]

    Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

    Do, Heejin and Kim, Yunsu and Lee, Gary Geunbae. Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.98

  4. [4]

    URLhttps://doi.org/10.1016/j.heliyon.2024.e34262

    Xiaoyi Tang and Hongwei Chen and Daoyu Lin and Kexin Li , keywords =. Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.heliyon.2024.e34262 , url =

  5. [5]

    Harnessing LLMs for Multi-dimensional Writing Assessment: Reliability and Alignment with Human Judgments , volume =

    Tang, Xiaoyi and Chen, Hongwei and Lin, Daoyu and Li, Kexin , year =. Harnessing LLMs for Multi-dimensional Writing Assessment: Reliability and Alignment with Human Judgments , volume =. Heliyon , doi =

  6. [6]

    2023 , eprint=

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

  7. [7]

    2021 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

  8. [8]

    2019 , eprint=

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. 2019 , eprint=

  9. [9]

    Artificial Intelligence Review , year=

    An automated essay scoring systems: a systematic literature review , author=. Artificial Intelligence Review , year=

  10. [10]

    Astronomy and Computing , keywords =

    Scott A. Crossley and Perpetual Baffour and L. Burleigh and Jules King , keywords =. A large-scale corpus for assessing source-based writing quality: ASAP 2.0 , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.asw.2025.100954 , url =

  11. [11]

    Page , journal =

    Ellis B. Page , journal =. The Imminence of... Grading Essays by Computer , urldate =

  12. [12]

    Automated scoring and annotation of essays with the Intelligent Essay Assessor , journal =

    Landauer, Thomas and Laham, Darrell and Foltz, Peter , year =. Automated scoring and annotation of essays with the Intelligent Essay Assessor , journal =

  13. [13]

    The Journal of Technology, Learning and Assessment , author=

    Automated Essay Scoring With e-rater® V.2 , volume=. The Journal of Technology, Learning and Assessment , author=. 2006 , month=

  14. [14]

    A Neural Approach to Automated Essay Scoring

    Taghipour, Kaveh and Ng, Hwee Tou. A Neural Approach to Automated Essay Scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1193

  15. [15]

    Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring

    Dong, Fei and Zhang, Yue and Yang, Jie. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017). 2017. doi:10.18653/v1/K17-1017

  16. [16]

    On the Use of Bert for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation

    Wang, Yongjie and Wang, Chuang and Li, Ruobing and Lin, Hui. On the Use of Bert for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.249

  17. [17]

    Language Teaching Research , volume =

    Mark Warschauer and Paige Ware , title =. Language Teaching Research , volume =. 2006 , doi =. https://doi.org/10.1191/1362168806lr190oa , abstract =

  18. [18]

    On the relation between automated essay scoring and modern views of the writing construct , journal =

    Paul Deane , keywords =. On the relation between automated essay scoring and modern views of the writing construct , journal =. 2013 , note =. doi:https://doi.org/10.1016/j.asw.2012.10.002 , url =

  19. [19]

    Language Testing , volume =

    Ute Knoch , title =. Language Testing , volume =. 2009 , doi =. https://doi.org/10.1177/0265532208101008 , abstract =

  20. [20]

    2023 , eprint=

    PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , author=. 2023 , eprint=

  21. [21]

    Working Memory Identifies Reasoning Limits in Language Models

    Zhang, Chunhui and Jian, Yiren and Ouyang, Zhongyu and Vosoughi, Soroush. Working Memory Identifies Reasoning Limits in Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.938

  22. [22]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  23. [23]

    and Mordatch, Igor , title =

    Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  24. [24]

    Retrieval Augmentation Reduces Hallucination in Conversation

    Shuster, Kurt and Poff, Spencer and Chen, Moya and Kiela, Douwe and Weston, Jason. Retrieval Augmentation Reduces Hallucination in Conversation. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.320

  25. [25]

    Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

    Kumar, Rahul and Mathias, Sandeep and Saha, Sriparna and Bhattacharyya, Pushpak. Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.106

  26. [26]

    Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards

    Do, Heejin and Ryu, Sangwon and Lee, Gary. Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.917

  27. [27]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  28. [28]

    2024 , eprint=

    Are Large Language Models Good Essay Graders? , author=. 2024 , eprint=

  29. [29]

    2025 , eprint=

    Evaluating Scoring Bias in LLM-as-a-Judge , author=. 2025 , eprint=

  30. [30]

    LCES : Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

    Shibata, Takumi and Miyamura, Yuichi. LCES : Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1523

  31. [31]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , booktitle =

    Liang, Tian and He, Zhiwei and Jiao, Wenxiang and Wang, Xing and Wang, Yan and Wang, Rui and Yang, Yujiu and Shi, Shuming and Tu, Zhaopeng. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.992

  32. [32]

    2023 , eprint=

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate , author=. 2023 , eprint=

  33. [33]

    2025 , eprint=

    MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring , author=. 2025 , eprint=

  35. [35]

    2023 , eprint=

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=

  36. [36]

    Enhancing Multi-Agent Debate System Performance via Confidence Expression

    Lin, Zijie and Hooi, Bryan. Enhancing Multi-Agent Debate System Performance via Confidence Expression. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.343

  37. [37]

    2022 , eprint=

    Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

  38. [38]

    Automating Theory of Mind Assessment with a LLaMA-3-Powered Chatbot: Enhancing Faux Pas Detection in Autism , year=

    Fallah, Avisa and Keramati, Ali and Nazari, Mohammad Ali and Mirfazeli, Fatemeh Sadat , booktitle=. Automating Theory of Mind Assessment with a LLaMA-3-Powered Chatbot: Enhancing Faux Pas Detection in Autism , year=

  39. [39]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  40. [40]

    Applied Measurement in Education , volume =

    Brent Bridgeman and Catherine Trapani and Yigal Attali , title =. Applied Measurement in Education , volume =. 2012 , publisher =. doi:10.1080/08957347.2012.635502 , URL =

  41. [41]

    Human ratings and automated essay evaluation , journal =

    Bridgeman, Brent , year =. Human ratings and automated essay evaluation , journal =

  42. [42]

    2025 , eprint=

    Integrating LLMs for Grading and Appeal Resolution in Computer Science Education , author=. 2025 , eprint=

  43. [43]

    Journal of Learning Analytics , author=

    The Effects of Explanations in Automated Essay Scoring Systems on Student Trust and Motivation , volume=. Journal of Learning Analytics , author=. 2023 , month=. doi:10.18608/jla.2023.7801 , abstractNote=

  44. [44]

    Artificial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14–18, 2021, Proceedings, Part I , pages =

    Litman, Diane and Zhang, Haoran and Correnti, Richard and Matsumura, Lindsay Clare and Wang, Elaine , title =. Artificial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Netherlands, June 14–18, 2021, Proceedings, Part I , pages =. 2021 , isbn =. doi:10.1007/978-3-030-78292-4_21 , abstract =

  45. [45]

    Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education

    Schaller, Nils-Jonathan and Ding, Yuning and Horbach, Andrea and Meyer, Jennifer and Jansen, Thorben. Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

  46. [46]

    2025 , booktitle =

    Says Who? Effective Zero-Shot Annotation of Focalization , author =. 2025 , booktitle =

  47. [47]

    (ACL Findings 2023: ProTACT / PORTACT)

  48. [48]

    \ TRAIT\_NAME

    (Findings EMNLP 2025) Mitigating Middle-Score Bias (RQ2) LLM-based judges exhibit middle-score bias, clustering predictions toward the center of the scoring scale and avoiding extreme values even when warranted zheng2023judgingllmasajudgemtbenchchatbot, li2025evaluatingscoringbiasllmasajudge . This is particularly problematic for formative assessment, whe...