arxiv: 2604.18660 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks

Jin Zhao , Marta Kne\v{z}evi\'c , Tanja K\"aser

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM tutorsanswer leakageadversarial attacksjailbreakingrobustness benchmarkeducational AIstudent misusedefensive strategies

0 comments

The pith

A fine-tuned adversarial student agent can jailbreak LLM tutors to extract complete answers, serving as the core of a new benchmark for tutor robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how LLM tutors designed for scaffolding instead leak full solutions when students try to extract answers through adversarial means. It tests multiple tutor models and families against adapted persuasive and jailbreaking techniques, showing that standard in-context adversarial agents often prove ineffective. The authors therefore develop and fine-tune a dedicated adversarial student agent that succeeds at eliciting answers more reliably. They position this agent as the foundation for a standardized benchmark that measures how robust educational LLMs remain under misuse. The work also identifies basic defense approaches that lower leakage rates in these scenarios.

Core claim

We introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors by eliciting complete solutions rather than guided learning steps, and we propose this agent as the central element of a standardized benchmark for evaluating the robustness of LLM tutors against adversarial student attacks.

What carries the argument

The fine-tuned adversarial student agent, which adapts jailbreaking and persuasive techniques to the tutoring context to probe for answer leakage in model responses.

If this is right

Tutor models from different families and with different alignment methods show varying degrees of susceptibility to answer leakage under these attacks.
Simple defense strategies can be applied to reduce the rate at which tutors reveal final answers.
Pedagogically aligned and multi-agent tutor designs may provide partial resistance compared to base models.
A standardized benchmark built around the fine-tuned agent allows consistent comparison of robustness across future tutor systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to test other interactive AI systems where users might seek to bypass intended constraints.
Widespread adoption might push developers to prioritize leakage prevention when deploying tutors in real classrooms.
The finding that fine-tuned agents outperform in-context ones suggests similar attacker fine-tuning could improve safety testing in other conversational domains.

Load-bearing premise

The fine-tuned adversarial student agent and the chosen attack techniques faithfully represent how real students would attempt to misuse LLM tutors, and the evaluated scenarios cover the main forms of such misuse.

What would settle it

Measure answer leakage rates in live tutoring sessions where students are given access to the same prompts and strategies produced by the fine-tuned agent and compare those rates to the benchmark results.

Figures

Figures reproduced from arXiv: 2604.18660 by Jin Zhao, Marta Kne\v{z}evi\'c, Tanja K\"aser.

**Figure 2.** Figure 2: Proposed framework, in which tutor robustness is evaluated through multi-turn dialogues between a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Average tutor answer leakage rate (left) and average number of turns before answer disclosure (right) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Student leakage rate across difficulty levels, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Student leakage rate across difficulty levels, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 5.** Figure 5: Comparison between manually defined prompts and LLM-generated adversarial attacks across different adversarial attack techniques and tutor models. Manually defined prompts consistently induce higher leakage than LLM-generated attacks across most models, and persuasive techniques (contextual manipulation, interpersonal influence, request shaping) outperform adversarial ones (intentional wrong answer, dire… view at source ↗

**Figure 8.** Figure 8: Leakage rate and technique usage distribu [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used in education, yet their default helpfulness often conflicts with pedagogical principles. Prior work evaluates pedagogical quality via answer leakage-the disclosure of complete solutions instead of scaffolding-but typically assumes well-intentioned learners, leaving tutor robustness under student misuse largely unexplored. In this paper, we study scenarios where students behave adversarially and aim to obtain the correct answer from the tutor. We evaluate a broad set of LLM-based tutor models, including different model families, pedagogically aligned models, and a multi-agent design, under a range of adversarial student attacks. We adapt six groups of adversarial and persuasive techniques to the educational setting and use them to probe how likely a tutor is to reveal the final answer. We evaluate answer leakage robustness using different types of in-context adversarial student agents, finding that they often fail to carry out effective attacks. We therefore introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors, which we propose as the core of a standardized benchmark for evaluating tutor robustness. Finally, we present simple but effective defense strategies that reduce answer leakage and strengthen the robustness of LLM-based tutors in adversarial scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates LLM-based tutors for answer leakage robustness under adversarial student attacks, adapting six groups of adversarial/persuasive techniques to the educational domain. It finds in-context adversarial student agents largely ineffective, introduces a fine-tuned adversarial student agent as the core of a proposed standardized benchmark for tutor robustness, and outlines simple defense strategies to reduce leakage.

Significance. If the fine-tuned agent's attacks prove representative of real student misuse and the evaluation provides reproducible metrics, the work could establish a useful benchmark for hardening educational LLMs against answer leakage, addressing a gap in prior pedagogical evaluations that assume benign users. The adaptation of existing attack methods to tutoring scenarios and the identification of in-context agent limitations are constructive contributions, though the benchmark proposal's impact hinges on external validation.

major comments (2)

[Abstract / Evaluation Setup] Abstract and evaluation setup: the claims of attack effectiveness, in-context agent failure, and fine-tuned agent superiority are presented without any quantitative results, error bars, specific metrics (e.g., leakage rates per tutor/model), model details, or data on how attacks were measured and scored. This absence is load-bearing for the central claims about robustness differences and the benchmark proposal.
[Benchmark Proposal] Benchmark proposal (final sections): proposing the fine-tuned adversarial student agent as the core of a standardized benchmark requires evidence that it captures authentic real-world adversarial student strategies. No comparison to human adversarial interactions, ecological validity tests, or coverage analysis of misuse scenarios is provided, leaving the benchmark's representativeness unverified.

minor comments (2)

[Methods] Clarify the exact definition and measurement protocol for 'answer leakage' (e.g., binary disclosure vs. partial hints) and how it is distinguished from legitimate scaffolding in the evaluation.
[Agent Fine-Tuning] Provide details on the fine-tuning dataset, hyperparameters, and base model for the adversarial agent to enable reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the presentation of our quantitative findings and the proposed benchmark. We address each major comment below.

read point-by-point responses

Referee: [Abstract / Evaluation Setup] Abstract and evaluation setup: the claims of attack effectiveness, in-context agent failure, and fine-tuned agent superiority are presented without any quantitative results, error bars, specific metrics (e.g., leakage rates per tutor/model), model details, or data on how attacks were measured and scored. This absence is load-bearing for the central claims about robustness differences and the benchmark proposal.

Authors: We acknowledge that the abstract presents findings at a high level without specific numbers. The full manuscript (Sections 3-5) contains the requested quantitative details: leakage rates per tutor and model (with standard deviations), explicit comparisons showing in-context agents' low success rates versus the fine-tuned agent's higher effectiveness, model specifications, and the leakage scoring protocol (binary detection of final-answer disclosure plus partial-credit variants). To address the concern directly, we will revise the abstract to incorporate key metrics (e.g., average leakage reduction and agent-type deltas) and add a concise paragraph in the evaluation setup that summarizes the measurement methodology with an example. revision: yes
Referee: [Benchmark Proposal] Benchmark proposal (final sections): proposing the fine-tuned adversarial student agent as the core of a standardized benchmark requires evidence that it captures authentic real-world adversarial student strategies. No comparison to human adversarial interactions, ecological validity tests, or coverage analysis of misuse scenarios is provided, leaving the benchmark's representativeness unverified.

Authors: We agree that external validation against human data would strengthen the benchmark proposal. The current work systematically adapts six groups of established adversarial techniques to tutoring dialogues and trains the agent on successful trajectories from those adaptations; we also provide a qualitative mapping of these techniques to documented student misuse patterns. Direct human comparisons and ecological tests are outside the scope of this study. In revision we will add an explicit coverage analysis of the adapted techniques, a dedicated limitations subsection on representativeness, and a concrete plan for future human validation studies, thereby framing the benchmark as a reproducible starting point rather than a fully validated standard. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical adaptation and fine-tuning remain independent of inputs

full rationale

The paper adapts existing adversarial techniques to the tutoring domain, observes that in-context agents are ineffective, then fine-tunes a new agent and evaluates leakage rates across tutor models. This chain relies on external data and standard fine-tuning rather than defining the benchmark via its own outputs or renaming fitted parameters as predictions. No self-citations are load-bearing for the central claim, no uniqueness theorems are imported, and no ansatz or renaming patterns appear. The proposal of the fine-tuned agent as a benchmark core follows directly from the reported empirical results without reducing to constructional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that adapted adversarial techniques and the fine-tuned agent are representative of student misuse; no free parameters or invented entities beyond the agent are evident from the abstract.

axioms (1)

domain assumption Adversarial and persuasive techniques from general domains can be effectively adapted to probe educational LLM tutors for answer leakage.
The paper adapts six groups of such techniques to the educational setting.

invented entities (1)

fine-tuned adversarial student agent no independent evidence
purpose: To serve as an effective jailbreak method and core of a standardized benchmark for tutor robustness.
Introduced as a new component after standard attacks failed; no independent evidence of its representativeness provided in abstract.

pith-pipeline@v0.9.0 · 5510 in / 1185 out tokens · 65190 ms · 2026-05-10T04:18:12.028248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 30 canonical work pages · 4 internal anchors

[2]

Thomas and Jionghao Lin and Kenneth R

Dollaya Hirunyasiri and Danielle R. Thomas and Jionghao Lin and Kenneth R. Koedinger and Vincent Aleven , year=
[6]

Opportunities and Challenges in Neural Dialog Tutoring

Macina, Jakub and Daheim, Nico and Wang, Lingzhi and Sinha, Tanmay and Kapur, Manu and Gurevych, Iryna and Sachan, Mrinmaya. Opportunities and Challenges in Neural Dialog Tutoring. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.173

work page doi:10.18653/v1/2023.eacl-main.173 2023
[9]

2025 , eprint=

Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

Detecting LLM-Generated Short Answers and Effects on Learner Performance , author=. 2025 , eprint=

2025
[13]

Journal of Applied Learning & Teaching , volume=

The role of ChatGPT in higher education: Benefits, challenges, and future research directions , author=. Journal of Applied Learning & Teaching , volume=. 2023 , publisher=

2023
[15]

Advances in Neural Information Processing Systems , volume=

SocraticLM: Exploring socratic personalized teaching with large language models , author=. Advances in Neural Information Processing Systems , volume=
[16]

Proceedings of the 26th Australasian Computing Education Conference , pages=

Next-step hint generation for introductory programming using large language models , author=. Proceedings of the 26th Australasian Computing Education Conference , pages=
[19]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[21]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[22]

and Wong, Eric , booktitle=

Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J. and Wong, Eric , booktitle=. Jailbreaking Black Box Large Language Models in Twenty Queries , year=
[26]

International Conference on Artificial Intelligence in Education , pages=

Beyond final answers: Evaluating large language models for math tutoring , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

2025
[29]

2025 , eprint=

Small Models, Big Support: A Local LLM Framework for Educator-Centric Content Creation and Assessment with RAG and CAG , author=. 2025 , eprint=

2025
[30]

2025 , eprint=

Easy Come, Easy Go? Examining the Perceptions and Learning Effects of LLM-based Chatbot in the Context of Search-as-Learning , author=. 2025 , eprint=

2025
[31]

Psychology and the real world: Essays illustrating fundamental contributions to society , volume=

Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning , author=. Psychology and the real world: Essays illustrating fundamental contributions to society , volume=
[32]

, author=

Desirable difficulties in theory and practice. , author=. Journal of Applied research in Memory and Cognition , volume=. 2020 , publisher=

2020
[33]

2025 , eprint=

CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation , author=. 2025 , eprint=

2025
[35]

2025 , eprint=

TeachLM: Post-Training LLMs for Education Using Authentic Learning Data , author=. 2025 , eprint=

2025
[36]

Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

Advprompter: Fast adaptive adversarial prompting for llms , author=. arXiv preprint arXiv:2404.16873 , year=

work page arXiv
[37]

AgentHarm: A Benchmark for Measuring Harmfulness of

Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=

2025
[38]

Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=

Llm-as-a-tutor in efl writing education: Focusing on evaluation of student-llm interaction , author=. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=
[40]

Simulating classroom education with llm-empowered agents , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[41]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[43]

Gonzalez and Ion Stoica , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

2023
[44]

International Conference on Artificial Intelligence in Education , pages=

Student perceptions of adaptive goal setting recommendations: a design prototyping study , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

2025
[45]

European conference on technology enhanced learning , pages=

AI or human? Evaluating student feedback perceptions in higher education , author=. European conference on technology enhanced learning , pages=. 2024 , organization=

2024
[46]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[48]

2025 , eprint=

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors , author=. 2025 , eprint=

2025
[51]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[52]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[53]

Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and 1 others. 2025. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387

work page arXiv 2025
[54]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. 2025. https://openreview.net/forum?id=AC5n7xHuR1 Agentharm: A benchmark for measuring harmfulness of LLM agents . In The Thirteenth International Conference on Learning Rep...

2025
[55]

Elizabeth L Bjork and Robert A Bjork. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the real world: Essays illustrating fundamental contributions to society, 2(59-68):56--64

2011
[56]

Robert A Bjork and Elizabeth L Bjork. 2020. Desirable difficulties in theory and practice. Journal of Applied research in Memory and Cognition, 9(4):475

2020
[57]

Conrad Borchers, Cindy Peng, Qianru Lyu, Paulo F Carvalho, Kenneth R Koedinger, and Vincent Aleven. 2025. Student perceptions of adaptive goal setting recommendations: a design prototyping study. In International Conference on Artificial Intelligence in Education, pages 244--251. Springer

2025
[58]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. https://doi.org/10.1109/SaTML64287.2025.00010 Jailbreaking black box large language models in twenty queries . In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42

work page doi:10.1109/satml64287.2025.00010 2025
[59]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Yulin Chen, Ning Ding, Hai-Tao Zheng, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2024. https://doi.org/10.1145/3627673.3679665 Empowering private tutoring by chaining large language models . In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM '24, page 354–364, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3627673.3679665 2024
[61]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[62]

Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.478 Stepwise verification and remediation of student reasoning errors with large language model tutors . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8386--8411, Miami, Florida, U...

work page doi:10.18653/v1/2024.emnlp-main.478 2024
[63]

David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, and Mrinmaya Sachan. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.15 From problem-solving to teaching problem-solving: Aligning LLM s with pedagogy using reinforcement learning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2...

work page doi:10.18653/v1/2025.emnlp-main.15 2025
[64]

Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, and Tanja K \"a ser. 2025. Scribe: Structured chain reasoning for interactive behaviour explanations using tool calling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29285--29310

2025
[65]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J MacLellan. 2025. Beyond final answers: Evaluating large language models for math tutoring. In International Conference on Artificial Intelligence in Education, pages 323--337. Springer

2025
[67]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

2021
[68]

Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, and Xiangjie Kong. 2026. https://doi.org/10.1609/aaai.v40i37.40399 Eduguardbench: A holistic benchmark for evaluating the pedagogical fidelity and adversarial safety of llms as simulated teachers . Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):31356--31364

work page doi:10.1609/aaai.v40i37.40399 2026
[69]

Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, and 1 others. 2024. Towards responsible development of generative ai for education: An evaluation-driven approach. arXiv preprint arXiv:2407.12687

work page arXiv 2024
[70]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 611–626, New York, ...

work page doi:10.1145/3600006.3613165 2023
[71]

Jiayu Liu, Zhenya Huang, Tong Xiao, Jing Sha, Jinze Wu, Qi Liu, Shijin Wang, and Enhong Chen. 2024. Socraticlm: Exploring socratic personalized teaching with large language models. Advances in Neural Information Processing Systems, 37:85693--85721

2024
[72]

Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.372 M ath D ial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems . In Findings of the Association for Computational Linguistics: EMNLP 2023, pag...

work page doi:10.18653/v1/2023.findings-emnlp.372 2023
[73]

Kaushal Kumar Maurya, Kv Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. https://doi.org/10.18653/v1/2025.naacl-long.57 Unifying AI tutor evaluation: An evaluation taxonomy for pedagogical ability assessment of LLM -powered AI tutors . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comp...

work page doi:10.18653/v1/2025.naacl-long.57 2025
[74]

Qurat-ul-Ann Mirza and N Sanjay Rebello. 2025. Help or hype? students' engagement and perception of using ai to solve physics problems. arXiv preprint arXiv:2509.25642

work page arXiv 2025
[75]

Tanya Nazaretsky, Paola Mejia-Domenzain, Vinitra Swamy, Jibril Frej, and Tanja K \"a ser. 2024. Ai or human? evaluating student feedback perceptions in higher education. In European conference on technology enhanced learning, pages 284--298. Springer

2024
[76]

Zachary A Pardos and Shreya Bhandari. 2023. Learning gain differences between chatgpt and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871

work page arXiv 2023
[77]

Minju Park, Sojung Kim, Seunghyun Lee, Soonwoo Kwon, and Kyuseok Kim. 2024. https://doi.org/10.1145/3613905.3651122 Empowering personalized learning through a conversation-based tutoring system with student modeling . In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI ’24, page 1–10. ACM

work page doi:10.1145/3613905.3651122 2024
[78]

Janos Perczel, Jin Chow, and Dorottya Demszky. 2025. https://arxiv.org/abs/2510.05087 Teachlm: Post-training llms for education using authentic learning data . Preprint, arXiv:2510.05087

work page arXiv 2025
[79]

Romain Puech, Jakub Macina, Julia Chatain, Mrinmaya Sachan, and Manu Kapur. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1348 Towards the pedagogical steering of large language models for tutoring: A case study with modeling productive failure . In Findings of the Association for Computational Linguistics: ACL 2025, pages 26291--26311, Vienna, Aust...

work page doi:10.18653/v1/2025.findings-acl.1348 2025
[80]

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763--24785

2025
[81]

Lianne Roest, Hieke Keuning, and Johan Jeuring. 2024. Next-step hint generation for introductory programming using large language models. In Proceedings of the 26th Australasian Computing Education Conference, pages 144--153

2024
[82]

Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, and Andrew Lan. 2025. https://doi.org/10.1007/978-3-031-98414-3_18 Training llm-based tutors to improve student learning outcomes in dialogues . In Artificial Intelligence in Education: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I, page 251–...

work page doi:10.1007/978-3-031-98414-3_18 2025
[83]

Yao Shi, Rongkeng Liang, and Yong Xu. 2025. https://doi.org/10.18653/v1/2025.acl-long.1576 Educationq: Evaluating llms’ teaching capabilities through multi-agent dialogue framework . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 32799–32828. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-long.1576 2025
[84]

Shashank Sonkar, Naiming Liu, Debshila Mallick, and Richard Baraniuk. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.130 CLASS : A design framework for building intelligent tutoring systems based on learning science principles . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1941--1961, Singapore. Association for Co...

work page doi:10.18653/v1/2023.findings-emnlp.130 2023
[85]

Shashank Sonkar, Kangqi Ni, Sapana Chaudhary, and Richard Baraniuk. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.797 Pedagogical alignment of large language models . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13641--13650, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.797 2024
[86]

Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, and Chen Xing. 2025. https://arxiv.org/abs/2510.02663 Tutorbench: A benchmark to assess tutoring capabilities of large language models . Prepri...

work page arXiv 2025
[87]

Anaïs Tack and Chris Piech. 2022. https://doi.org/10.5281/zenodo.6853187 The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues . In Proceedings of the 15th International Conference on Educational Data Mining, pages 522--529, Durham, United Kingdom. International Educational Data Mining Society

work page doi:10.5281/zenodo.6853187 2022
[88]

Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. https://doi.org/10.18653/v1/2024.naacl-long.120 Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2024.naacl-long.120 2024
[89]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

2022
[90]

Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[91]

Shuzhou Yuan, William LaCroix, Hardik Ghoshal, Ercong Nie, and Michael Färber. 2025. https://arxiv.org/abs/2508.08386 Codae: Adapting large language models for education via chain-of-thought data augmentation . Preprint, arXiv:2508.08386

work page arXiv 2025
[92]

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. https://doi.org/10.18653/v1/2024.acl-long.773 How johnny can persuade LLM s to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLM s . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

work page doi:10.18653/v1/2024.acl-long.773 2024
[93]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://openreview.net/forum?id=uccHPGDlao Judging LLM -as-a-judge with MT -bench and chatbot arena . In Thirty-seventh Conference on Neural Information Processing Systems Da...

2023