Recognition: unknown
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3
The pith
A fine-tuned adversarial student agent can jailbreak LLM tutors to extract complete answers, serving as the core of a new benchmark for tutor robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors by eliciting complete solutions rather than guided learning steps, and we propose this agent as the central element of a standardized benchmark for evaluating the robustness of LLM tutors against adversarial student attacks.
What carries the argument
The fine-tuned adversarial student agent, which adapts jailbreaking and persuasive techniques to the tutoring context to probe for answer leakage in model responses.
If this is right
- Tutor models from different families and with different alignment methods show varying degrees of susceptibility to answer leakage under these attacks.
- Simple defense strategies can be applied to reduce the rate at which tutors reveal final answers.
- Pedagogically aligned and multi-agent tutor designs may provide partial resistance compared to base models.
- A standardized benchmark built around the fine-tuned agent allows consistent comparison of robustness across future tutor systems.
Where Pith is reading between the lines
- The benchmark could be adapted to test other interactive AI systems where users might seek to bypass intended constraints.
- Widespread adoption might push developers to prioritize leakage prevention when deploying tutors in real classrooms.
- The finding that fine-tuned agents outperform in-context ones suggests similar attacker fine-tuning could improve safety testing in other conversational domains.
Load-bearing premise
The fine-tuned adversarial student agent and the chosen attack techniques faithfully represent how real students would attempt to misuse LLM tutors, and the evaluated scenarios cover the main forms of such misuse.
What would settle it
Measure answer leakage rates in live tutoring sessions where students are given access to the same prompts and strategies produced by the fine-tuned agent and compare those rates to the benchmark results.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly used in education, yet their default helpfulness often conflicts with pedagogical principles. Prior work evaluates pedagogical quality via answer leakage-the disclosure of complete solutions instead of scaffolding-but typically assumes well-intentioned learners, leaving tutor robustness under student misuse largely unexplored. In this paper, we study scenarios where students behave adversarially and aim to obtain the correct answer from the tutor. We evaluate a broad set of LLM-based tutor models, including different model families, pedagogically aligned models, and a multi-agent design, under a range of adversarial student attacks. We adapt six groups of adversarial and persuasive techniques to the educational setting and use them to probe how likely a tutor is to reveal the final answer. We evaluate answer leakage robustness using different types of in-context adversarial student agents, finding that they often fail to carry out effective attacks. We therefore introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors, which we propose as the core of a standardized benchmark for evaluating tutor robustness. Finally, we present simple but effective defense strategies that reduce answer leakage and strengthen the robustness of LLM-based tutors in adversarial scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates LLM-based tutors for answer leakage robustness under adversarial student attacks, adapting six groups of adversarial/persuasive techniques to the educational domain. It finds in-context adversarial student agents largely ineffective, introduces a fine-tuned adversarial student agent as the core of a proposed standardized benchmark for tutor robustness, and outlines simple defense strategies to reduce leakage.
Significance. If the fine-tuned agent's attacks prove representative of real student misuse and the evaluation provides reproducible metrics, the work could establish a useful benchmark for hardening educational LLMs against answer leakage, addressing a gap in prior pedagogical evaluations that assume benign users. The adaptation of existing attack methods to tutoring scenarios and the identification of in-context agent limitations are constructive contributions, though the benchmark proposal's impact hinges on external validation.
major comments (2)
- [Abstract / Evaluation Setup] Abstract and evaluation setup: the claims of attack effectiveness, in-context agent failure, and fine-tuned agent superiority are presented without any quantitative results, error bars, specific metrics (e.g., leakage rates per tutor/model), model details, or data on how attacks were measured and scored. This absence is load-bearing for the central claims about robustness differences and the benchmark proposal.
- [Benchmark Proposal] Benchmark proposal (final sections): proposing the fine-tuned adversarial student agent as the core of a standardized benchmark requires evidence that it captures authentic real-world adversarial student strategies. No comparison to human adversarial interactions, ecological validity tests, or coverage analysis of misuse scenarios is provided, leaving the benchmark's representativeness unverified.
minor comments (2)
- [Methods] Clarify the exact definition and measurement protocol for 'answer leakage' (e.g., binary disclosure vs. partial hints) and how it is distinguished from legitimate scaffolding in the evaluation.
- [Agent Fine-Tuning] Provide details on the fine-tuning dataset, hyperparameters, and base model for the adversarial agent to enable reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights opportunities to strengthen the presentation of our quantitative findings and the proposed benchmark. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / Evaluation Setup] Abstract and evaluation setup: the claims of attack effectiveness, in-context agent failure, and fine-tuned agent superiority are presented without any quantitative results, error bars, specific metrics (e.g., leakage rates per tutor/model), model details, or data on how attacks were measured and scored. This absence is load-bearing for the central claims about robustness differences and the benchmark proposal.
Authors: We acknowledge that the abstract presents findings at a high level without specific numbers. The full manuscript (Sections 3-5) contains the requested quantitative details: leakage rates per tutor and model (with standard deviations), explicit comparisons showing in-context agents' low success rates versus the fine-tuned agent's higher effectiveness, model specifications, and the leakage scoring protocol (binary detection of final-answer disclosure plus partial-credit variants). To address the concern directly, we will revise the abstract to incorporate key metrics (e.g., average leakage reduction and agent-type deltas) and add a concise paragraph in the evaluation setup that summarizes the measurement methodology with an example. revision: yes
-
Referee: [Benchmark Proposal] Benchmark proposal (final sections): proposing the fine-tuned adversarial student agent as the core of a standardized benchmark requires evidence that it captures authentic real-world adversarial student strategies. No comparison to human adversarial interactions, ecological validity tests, or coverage analysis of misuse scenarios is provided, leaving the benchmark's representativeness unverified.
Authors: We agree that external validation against human data would strengthen the benchmark proposal. The current work systematically adapts six groups of established adversarial techniques to tutoring dialogues and trains the agent on successful trajectories from those adaptations; we also provide a qualitative mapping of these techniques to documented student misuse patterns. Direct human comparisons and ecological tests are outside the scope of this study. In revision we will add an explicit coverage analysis of the adapted techniques, a dedicated limitations subsection on representativeness, and a concrete plan for future human validation studies, thereby framing the benchmark as a reproducible starting point rather than a fully validated standard. revision: partial
Circularity Check
No circularity: empirical adaptation and fine-tuning remain independent of inputs
full rationale
The paper adapts existing adversarial techniques to the tutoring domain, observes that in-context agents are ineffective, then fine-tunes a new agent and evaluates leakage rates across tutor models. This chain relies on external data and standard fine-tuning rather than defining the benchmark via its own outputs or renaming fitted parameters as predictions. No self-citations are load-bearing for the central claim, no uniqueness theorems are imported, and no ansatz or renaming patterns appear. The proposal of the fine-tuned agent as a benchmark core follows directly from the reported empirical results without reducing to constructional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adversarial and persuasive techniques from general domains can be effectively adapted to probe educational LLM tutors for answer leakage.
invented entities (1)
-
fine-tuned adversarial student agent
no independent evidence
Reference graph
Works this paper leans on
-
[2]
Thomas and Jionghao Lin and Kenneth R
Dollaya Hirunyasiri and Danielle R. Thomas and Jionghao Lin and Kenneth R. Koedinger and Vincent Aleven , year=
-
[6]
Opportunities and Challenges in Neural Dialog Tutoring
Macina, Jakub and Daheim, Nico and Wang, Lingzhi and Sinha, Tanmay and Kapur, Manu and Gurevych, Iryna and Sachan, Mrinmaya. Opportunities and Challenges in Neural Dialog Tutoring. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.173
-
[9]
2025 , eprint=
Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation , author=. 2025 , eprint=
2025
-
[11]
2025 , eprint=
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models , author=. 2025 , eprint=
2025
-
[12]
2025 , eprint=
Detecting LLM-Generated Short Answers and Effects on Learner Performance , author=. 2025 , eprint=
2025
-
[13]
Journal of Applied Learning & Teaching , volume=
The role of ChatGPT in higher education: Benefits, challenges, and future research directions , author=. Journal of Applied Learning & Teaching , volume=. 2023 , publisher=
2023
-
[15]
Advances in Neural Information Processing Systems , volume=
SocraticLM: Exploring socratic personalized teaching with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Proceedings of the 26th Australasian Computing Education Conference , pages=
Next-step hint generation for introductory programming using large language models , author=. Proceedings of the 26th Australasian Computing Education Conference , pages=
-
[19]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[21]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[22]
and Wong, Eric , booktitle=
Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J. and Wong, Eric , booktitle=. Jailbreaking Black Box Large Language Models in Twenty Queries , year=
-
[26]
International Conference on Artificial Intelligence in Education , pages=
Beyond final answers: Evaluating large language models for math tutoring , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=
2025
-
[29]
2025 , eprint=
Small Models, Big Support: A Local LLM Framework for Educator-Centric Content Creation and Assessment with RAG and CAG , author=. 2025 , eprint=
2025
-
[30]
2025 , eprint=
Easy Come, Easy Go? Examining the Perceptions and Learning Effects of LLM-based Chatbot in the Context of Search-as-Learning , author=. 2025 , eprint=
2025
-
[31]
Psychology and the real world: Essays illustrating fundamental contributions to society , volume=
Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning , author=. Psychology and the real world: Essays illustrating fundamental contributions to society , volume=
-
[32]
, author=
Desirable difficulties in theory and practice. , author=. Journal of Applied research in Memory and Cognition , volume=. 2020 , publisher=
2020
-
[33]
2025 , eprint=
CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation , author=. 2025 , eprint=
2025
-
[35]
2025 , eprint=
TeachLM: Post-Training LLMs for Education Using Authentic Learning Data , author=. 2025 , eprint=
2025
-
[36]
Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs
Advprompter: Fast adaptive adversarial prompting for llms , author=. arXiv preprint arXiv:2404.16873 , year=
-
[37]
AgentHarm: A Benchmark for Measuring Harmfulness of
Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=
2025
-
[38]
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=
Llm-as-a-tutor in efl writing education: Focusing on evaluation of student-llm interaction , author=. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=
-
[40]
Simulating classroom education with llm-empowered agents , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[41]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[43]
Gonzalez and Ion Stoica , booktitle=
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=
2023
-
[44]
International Conference on Artificial Intelligence in Education , pages=
Student perceptions of adaptive goal setting recommendations: a design prototyping study , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=
2025
-
[45]
European conference on technology enhanced learning , pages=
AI or human? Evaluating student feedback perceptions in higher education , author=. European conference on technology enhanced learning , pages=. 2024 , organization=
2024
-
[46]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[48]
2025 , eprint=
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors , author=. 2025 , eprint=
2025
-
[51]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[52]
2021 , eprint=
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
2021
-
[53]
Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and 1 others. 2025. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387
-
[54]
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. 2025. https://openreview.net/forum?id=AC5n7xHuR1 Agentharm: A benchmark for measuring harmfulness of LLM agents . In The Thirteenth International Conference on Learning Rep...
2025
-
[55]
Elizabeth L Bjork and Robert A Bjork. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the real world: Essays illustrating fundamental contributions to society, 2(59-68):56--64
2011
-
[56]
Robert A Bjork and Elizabeth L Bjork. 2020. Desirable difficulties in theory and practice. Journal of Applied research in Memory and Cognition, 9(4):475
2020
-
[57]
Conrad Borchers, Cindy Peng, Qianru Lyu, Paulo F Carvalho, Kenneth R Koedinger, and Vincent Aleven. 2025. Student perceptions of adaptive goal setting recommendations: a design prototyping study. In International Conference on Artificial Intelligence in Education, pages 244--251. Springer
2025
-
[58]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. https://doi.org/10.1109/SaTML64287.2025.00010 Jailbreaking black box large language models in twenty queries . In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42
-
[59]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Yulin Chen, Ning Ding, Hai-Tao Zheng, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2024. https://doi.org/10.1145/3627673.3679665 Empowering private tutoring by chaining large language models . In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM '24, page 354–364, New York, NY, USA. Association for Computing Machinery
-
[61]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[62]
Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.478 Stepwise verification and remediation of student reasoning errors with large language model tutors . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8386--8411, Miami, Florida, U...
-
[63]
David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, and Mrinmaya Sachan. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.15 From problem-solving to teaching problem-solving: Aligning LLM s with pedagogy using reinforcement learning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2...
-
[64]
Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, and Tanja K \"a ser. 2025. Scribe: Structured chain reasoning for interactive behaviour explanations using tool calling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29285--29310
2025
-
[65]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J MacLellan. 2025. Beyond final answers: Evaluating large language models for math tutoring. In International Conference on Artificial Intelligence in Education, pages 323--337. Springer
2025
-
[67]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations
2021
-
[68]
Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, and Xiangjie Kong. 2026. https://doi.org/10.1609/aaai.v40i37.40399 Eduguardbench: A holistic benchmark for evaluating the pedagogical fidelity and adversarial safety of llms as simulated teachers . Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):31356--31364
-
[69]
Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, and 1 others. 2024. Towards responsible development of generative ai for education: An evaluation-driven approach. arXiv preprint arXiv:2407.12687
-
[70]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 611–626, New York, ...
-
[71]
Jiayu Liu, Zhenya Huang, Tong Xiao, Jing Sha, Jinze Wu, Qi Liu, Shijin Wang, and Enhong Chen. 2024. Socraticlm: Exploring socratic personalized teaching with large language models. Advances in Neural Information Processing Systems, 37:85693--85721
2024
-
[72]
Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.372 M ath D ial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems . In Findings of the Association for Computational Linguistics: EMNLP 2023, pag...
-
[73]
Kaushal Kumar Maurya, Kv Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. https://doi.org/10.18653/v1/2025.naacl-long.57 Unifying AI tutor evaluation: An evaluation taxonomy for pedagogical ability assessment of LLM -powered AI tutors . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comp...
- [74]
-
[75]
Tanya Nazaretsky, Paola Mejia-Domenzain, Vinitra Swamy, Jibril Frej, and Tanja K \"a ser. 2024. Ai or human? evaluating student feedback perceptions in higher education. In European conference on technology enhanced learning, pages 284--298. Springer
2024
- [76]
-
[77]
Minju Park, Sojung Kim, Seunghyun Lee, Soonwoo Kwon, and Kyuseok Kim. 2024. https://doi.org/10.1145/3613905.3651122 Empowering personalized learning through a conversation-based tutoring system with student modeling . In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI ’24, page 1–10. ACM
- [78]
-
[79]
Romain Puech, Jakub Macina, Julia Chatain, Mrinmaya Sachan, and Manu Kapur. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1348 Towards the pedagogical steering of large language models for tutoring: A case study with modeling productive failure . In Findings of the Association for Computational Linguistics: ACL 2025, pages 26291--26311, Vienna, Aust...
-
[80]
Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763--24785
2025
-
[81]
Lianne Roest, Hieke Keuning, and Johan Jeuring. 2024. Next-step hint generation for introductory programming using large language models. In Proceedings of the 26th Australasian Computing Education Conference, pages 144--153
2024
-
[82]
Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, and Andrew Lan. 2025. https://doi.org/10.1007/978-3-031-98414-3_18 Training llm-based tutors to improve student learning outcomes in dialogues . In Artificial Intelligence in Education: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I, page 251–...
-
[83]
Yao Shi, Rongkeng Liang, and Yong Xu. 2025. https://doi.org/10.18653/v1/2025.acl-long.1576 Educationq: Evaluating llms’ teaching capabilities through multi-agent dialogue framework . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 32799–32828. Association for Computational Linguistics
-
[84]
Shashank Sonkar, Naiming Liu, Debshila Mallick, and Richard Baraniuk. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.130 CLASS : A design framework for building intelligent tutoring systems based on learning science principles . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1941--1961, Singapore. Association for Co...
-
[85]
Shashank Sonkar, Kangqi Ni, Sapana Chaudhary, and Richard Baraniuk. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.797 Pedagogical alignment of large language models . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13641--13650, Miami, Florida, USA. Association for Computational Linguistics
-
[86]
Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, and Chen Xing. 2025. https://arxiv.org/abs/2510.02663 Tutorbench: A benchmark to assess tutoring capabilities of large language models . Prepri...
-
[87]
Anaïs Tack and Chris Piech. 2022. https://doi.org/10.5281/zenodo.6853187 The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues . In Proceedings of the 15th International Conference on Educational Data Mining, pages 522--529, Durham, United Kingdom. International Educational Data Mining Society
-
[88]
Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. https://doi.org/10.18653/v1/2024.naacl-long.120 Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...
-
[89]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
2022
-
[90]
Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [91]
-
[92]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. https://doi.org/10.18653/v1/2024.acl-long.773 How johnny can persuade LLM s to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLM s . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...
-
[93]
Gonzalez, and Ion Stoica
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://openreview.net/forum?id=uccHPGDlao Judging LLM -as-a-judge with MT -bench and chatbot arena . In Thirty-seventh Conference on Neural Information Processing Systems Da...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.