GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents
Pith reviewed 2026-05-25 06:59 UTC · model grok-4.3
The pith
Adversarial prompt and token changes can alter LLM grading outcomes with high success and stealth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GradingAttack is a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. It designs token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. This reveals that current LLM based educational agents lack robust defenses against adversarial attacks.
What carries the argument
GradingAttack framework with token-level and prompt-level attack strategies that manipulate grading outcomes while maintaining high stealth.
If this is right
- LLM grading agents can have their outputs changed by adversarial inputs.
- Prompt-level attacks tend to succeed more frequently at altering grades.
- Token-level attacks are harder for observers to detect.
- Educational LLM agents require additional security measures to be trustworthy.
- Automated grading carries risks from undetected manipulation in real use.
Where Pith is reading between the lines
- Developers could add adversarial testing using these attack types before releasing grading agents.
- Similar manipulation risks may apply to other LLM agents making assessment or feedback decisions.
- Detection methods focused on input pattern anomalies could reduce success of these attacks.
- Defenses might need to be specific to short-answer grading rather than general LLM security.
Load-bearing premise
The tested LLM grading agents and datasets are representative of real-world educational deployments, and the attack success rates will hold when the agents are used in live classroom settings rather than controlled experiments.
What would settle it
A live deployment of an LLM grading agent in an actual classroom where attempted prompt-level and token-level attacks fail to change grades or are reliably detected by standard review processes.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GradingAttack, a fine-grained adversarial attack framework for LLM-based automatic short answer grading (ASAG) agents. It proposes token-level and prompt-level attack strategies that manipulate grading outcomes while aiming for high stealth, and claims that experiments on multiple datasets show both strategies effectively compromise the agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth.
Significance. If the empirical results hold with proper documentation, the work is significant for highlighting security risks in deployed educational AI systems, potentially motivating defenses for fairness and trustworthiness in automated assessment. The empirical attack framework, if reproducible with clear metrics and baselines, would contribute to the growing literature on LLM vulnerabilities in critical applications.
major comments (2)
- [Abstract] Abstract: the claim that 'experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents' provides no details on the LLMs tested, the datasets, the definition or computation of attack success rates, baselines, or any statistical significance tests. This absence makes it impossible to assess whether the data support the central claim.
- [Abstract] Abstract: the reported success rates rest on the untested assumption that the specific grading agents (LLMs + prompts) are representative; there is no indication that experiments varied system prompts, incorporated chain-of-thought reasoning, few-shot examples, or fine-tuned models, which could sharply reduce attack effectiveness in real deployments.
minor comments (1)
- [Abstract] The abstract refers to 'high stealth' and 'superior stealth capability' without defining the metric (e.g., detection rate by humans or other LLMs) or how it was measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and outline planned revisions to strengthen the presentation of our experimental claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents' provides no details on the LLMs tested, the datasets, the definition or computation of attack success rates, baselines, or any statistical significance tests. This absence makes it impossible to assess whether the data support the central claim.
Authors: We agree that the abstract, as a high-level summary, omits these specifics. The manuscript body (Sections 3.2, 4.1, and 5) details the LLMs evaluated (GPT-3.5-Turbo, GPT-4, Llama-2-7B), the datasets (SciEntsBank, Beetle, and two additional ASAG corpora), the attack success rate metric (fraction of responses whose assigned grade is altered to the attacker-chosen target), the baseline comparisons (random token replacement and prompt paraphrasing), and the use of paired t-tests for significance. To improve accessibility, we will revise the abstract to include a concise clause summarizing the LLMs, datasets, and primary success-rate definition. revision: yes
-
Referee: [Abstract] Abstract: the reported success rates rest on the untested assumption that the specific grading agents (LLMs + prompts) are representative; there is no indication that experiments varied system prompts, incorporated chain-of-thought reasoning, few-shot examples, or fine-tuned models, which could sharply reduce attack effectiveness in real deployments.
Authors: This is a fair observation. Our experiments used standard zero-shot system prompts with the listed off-the-shelf models; we did not ablate system-prompt wording, add chain-of-thought, few-shot exemplars, or evaluate fine-tuned graders. We will add an explicit limitations paragraph acknowledging that more elaborate prompting or fine-tuning could increase robustness, and we will frame the current results as evidence of vulnerabilities in typical current deployments rather than claiming universal representativeness. revision: partial
Circularity Check
No circularity: empirical attack evaluation with no derivations or self-referential reductions
full rationale
The paper introduces GradingAttack as an empirical adversarial framework consisting of token-level and prompt-level strategies, evaluated via experiments on multiple datasets. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described structure. The central claims rest on reported attack success rates from controlled experiments rather than any self-definitional or load-bearing reduction to inputs. This matches the default expectation for non-circular empirical security papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sridevi Bonthu, S. Rama Sree, and M. H. M. Krishna Prasad. Automated short answer grading using deep learning: A survey. InProceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Virtual Event, August 2021
work page 2021
-
[2]
The eras and trends of automatic short answer grading
Steven Burrows, Iryna Gurevych, and Benno Stein. The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25:60–117, 2015
work page 2015
-
[3]
Automatic short answer grading for finnish with chatgpt
Li-Hsin Chang and Filip Ginter. Automatic short answer grading for finnish with chatgpt. InProceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, March 2024
work page 2024
-
[4]
Impey Chris, Wenger Matthew, Garuda Nikhil, Golchin Shahriar, and Stamer Sarah. Using large language models for automated grading of student writing about science.International Journal of Artificial Intelligence in Education, pages 1–35, 2025
work page 2025
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.ACM Computing Surveys, pages 1–34, 2025
work page 2025
-
[7]
Yuning Ding, Brian Riordan, Andrea Horbach, Aoife Cahill, and Torsten Zesch. Don’t take “nswvt- nvakgxpm” for an answer–the surprising vulnerability of automatic content scoring systems to adversarial input. InProceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, December 2020
work page 2020
-
[8]
Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Ben- tivogli, Peter Clark, Ido Dagan, and Hoa Trang Dang. SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. InProceedings of the 7th International Workshop on Semantic Evaluation, Atlanta, Georgia, USA, June 2013
work page 2013
-
[9]
Anna Filighera, Sebastian Ochs, Tim Steuer, and Thomas Tregel. Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs.International Journal of Artificial Intelligence in Education, 34:616–646, 2024
work page 2024
-
[10]
Fooling automatic short answer grading systems
Anna Filighera, Tim Steuer, and Christoph Rensing. Fooling automatic short answer grading systems. In Proceedings of the 21st International Conference on Artificial Intelligence in Education, Ifrane, Morocco, July 2020. 10
work page 2020
-
[11]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of 34th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Virtual Event, December 2021
work page 2021
-
[12]
Helen A Klein, Nancy M Levenburg, Marie McKendall, and William Mothersell. Cheating during the college years: How do business school students compare?Journal of Business Ethics, 72:197–206, 2007
work page 2007
-
[13]
A multilingual dataset of adversarial attacks to automatic content scoring systems
Ronja Laarmann-Quante, Christopher Chandler, Noemi Incirkus, Vitaliia Ruban, Alona Solopov, and Luca Steen. A multilingual dataset of adversarial attacks to automatic content scoring systems. InProceedings of the 20th Conference on Natural Language Processing, Vienna, Austria, September 2024
work page 2024
-
[14]
Mwptoolkit: An open-source framework for deep learning-based math word problem solvers
Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers. InProceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Event, February 2022
work page 2022
-
[15]
Advancing adversarial suffix transfer learning on aligned large language models
Hongfu Liu, Yuxi Xie, Ye Wang, and Michael Shieh. Advancing adversarial suffix transfer learning on aligned large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024
work page 2024
-
[16]
Autodan: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InProceedings of the 12th International Conference on Learning Representations, Vienna, Austria, May 2024
work page 2024
-
[17]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Comuniqa : Exploring large language models for improving speaking skills
Manas Mhasakar, Shikhar Sharma, Apurv Mehra, Utkarsh Venaik, Ujjwal Singhal, Dhruv Kumar, and Kashish Mittal. Comuniqa : Exploring large language models for improving speaking skills. InProceedings of the 7th ACM SIGCAS/SIGCHI Conference of Computing and Sustainable Societies, New Delhi, India, July 2024
work page 2024
-
[19]
Haile Misgna, Byung-Won On, Ingyu Lee, and Gyu Sang Choi. A survey on deep learning-based automated essay scoring and feedback generation.Artificial Intelligence Review, 58:1–40, 2025
work page 2025
-
[20]
Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails
Sankalan Pal Chowdhury, Vilém Zouhar, and Mrinmaya Sachan. Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails. InProceedings of the 11th ACM Conference on Learning @ Scale, New York, NY , USA, July 2024
work page 2024
-
[21]
Marko Putnikovic and Jelena Jovanovic. Embeddings for automatic short answer grading: A scoping review.IEEE Transactions on Learning Technologies, 16:219–231, 2023
work page 2023
-
[22]
Mohi Reza, Nathan M Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan “Michael” Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, and Joseph Jay Williams. Abscribe: Rapid exploration & organization of multiple writing variations in human-ai co-writing tasks using large language models. InProceedings of the 2024 CHI Conference on Human ...
work page 2024
-
[23]
Enhancing short answer grading with openai apis
Sebastian Speiser and Annegret Weng. Enhancing short answer grading with openai apis. InProceedings of the 21st International Conference on Information Technology Based Higher Education and Training, Paris, France, November 2024
work page 2024
-
[24]
Pedro Antonio Gutiérrez Víctor Manuel Vargas and César Hervás-Martínez. Unimodal regularisation based on beta distribution for deep ordinal regression.Pattern Recognition, 122:1–10, February 2022
work page 2022
-
[25]
Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023
-
[26]
Bernard E Whitley. Factors associated with cheating among college students: A review.Research in Higher Education, 39:235–274, 1998
work page 1998
-
[27]
An llm can fool itself: A prompt-based adversarial attack.arXiv preprint arXiv:2310.13345, 2023
Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. An llm can fool itself: A prompt-based adversarial attack.arXiv preprint arXiv:2310.13345, 2023
-
[28]
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Boosting jailbreak attack with momentum
Yihao Zhang and Zeming Wei. Boosting jailbreak attack with momentum. InProceedings of the ICLR 2024 Workshop on Reliable and Responsible Foundation Models, Vienna, Austria, May 2024
work page 2024
-
[30]
Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, XIAOYU XU, Xiaobao Wu, Jie Fu, Feng Yichao, Fengjun Pan, and Anh Tuan Luu. A survey of recent backdoor attacks and defenses in large language models.Transactions on Machine Learning Research, pages 1–28, 2025
work page 2025
-
[31]
Universal vulnerabilities in large language models: Backdoor attacks for in-context learning
Shuai Zhao, Meihuizi Jia, Luu Anh Tuan, Fengjun Pan, and Jinming Wen. Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024
work page 2024
-
[32]
Li Zhong and Zilong Wang. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, February 2024
work page 2024
-
[33]
Virtual context enhancing jailbreak attacks with special token injection
Yuqi Zhou, Lin Lu, Ryan Sun, Pan Zhou, and Lichao Sun. Virtual context enhancing jailbreak attacks with special token injection. InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 2024
work page 2024
-
[34]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Effectiveness of CAS Metric Evaluating adversarial attack performance often relies solely on the ASR. However, this metric alone is insufficient to ref...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.