arxiv: 2604.12066 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CY

Recognition: unknown

Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

Candace Walkington , Theodora Beauchamp , Fareya Ikram , Merve Ko\c{c}yi\u{g}it G\"urb\"uz , Fangli Xia , Margan Lee , Andrew Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords multi-agent systemspersonalized learningmathematics educationlarge language modelsteacher-in-the-loopproblem generationauthenticity evaluationeducational AI

0 comments

The pith

Teachers and students want to tweak fine-grained real-world details in AI-generated math problems despite agent reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests a multi-agent system that lets a teacher enter a base middle-school math problem and a topic, after which an LLM creates a personalized version that four specialized agents then evaluate for mathematical accuracy, authenticity, readability, and realism. Eight teachers used the system to build 212 problems in ASSISTments and assigned them to students. Both teachers and students commonly requested changes to the specific real-world context elements, which the authors interpret as ongoing problems with authenticity and fit. The agents flagged many realism issues during generation, yet the final problems showed few such complaints from users, and problems with readability or mathematical errors were uncommon.

Core claim

Eight middle school mathematics teachers created 212 problems with a teacher-in-the-loop multi-agent system in which an LLM generates personalized problems and four agents review them for accuracy, authenticity, readability, and realism. Teachers and students wanted to modify fine-grained real-world context elements, signaling issues with authenticity and fit. Although the agents detected many realism issues while problems were being written, few realism issues appeared in the final versions that teachers and students reviewed. Issues with readability and mathematical hallucinations were rare.

What carries the argument

A teacher-in-the-loop multi-agent system that uses an LLM to generate personalized math problems from a teacher-supplied base problem and topic, followed by four AI agents that each specialize in one evaluation criterion: mathematical accuracy, authenticity, readability, or realism.

If this is right

The system enables teachers to produce large sets of personalized problems while retaining final control over content.
Agent reviews during generation can remove most realism problems before problems reach students.
Readability and mathematical accuracy problems remain infrequent in the output of the current system.
Fine-grained personalization of real-world contexts still requires human oversight for acceptable authenticity.
Multi-agent designs can support teacher control in educational personalization workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strengthening the realism agent's sensitivity to fine details of everyday contexts could reduce the need for later edits.
The pattern of requested changes could be used as training data to improve future versions of the agents.
The same multi-agent structure might be adapted to generate and review problems in subjects beyond middle-school mathematics.

Load-bearing premise

That teachers' and students' self-reported desires to modify specific real-world context elements reliably indicate shortcomings in authenticity and fit that the four-agent system failed to resolve.

What would settle it

A controlled comparison that measures the rate of context-modification requests when the same base problems are generated with the full four-agent system versus a version that omits the realism agent.

read the original abstract

Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Eight teachers generated 212 problems with a four-agent system and still wanted tweaks to real-world context details, but the paper gives no counts on revisions or baselines so the agents' impact stays unproven.

read the letter

The core takeaway is that this multi-agent workflow for middle-school math personalization produced problems where teachers and students still pushed for changes to the fine-grained real-world elements, even though the agents had flagged realism problems during generation. Readability and math accuracy issues stayed low in the end, but the link between agent feedback and those outcomes is not shown with data.

Referee Report

2 major / 2 minor

Summary. The paper describes a multi-agent LLM-based system for personalizing middle school math problems in which a teacher supplies a base problem and topic, the LLM generates a version, and four specialized agents then evaluate it for mathematical accuracy, authenticity, readability, and realism. Eight teachers used the system to create 212 problems in ASSISTments that were assigned to students; the authors report that both teachers and students frequently wanted to modify fine-grained real-world context elements (signaling authenticity/fit problems) while noting few realism issues in the final versions, even though the agents had flagged many realism problems during generation. Readability and mathematical hallucination issues were also described as rare.

Significance. If the central empirical claims can be substantiated with the missing methodological and revision data, the work would offer a useful descriptive account of how teachers interact with a teacher-in-the-loop multi-agent personalization pipeline in an authentic classroom setting. It would highlight both the value of agent-based critique for catching certain classes of problems and the persistent difficulty of generating authentic real-world contexts that satisfy teachers and students without further editing.

major comments (2)

[Abstract] Abstract and study description: the central claim that 'although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions' cannot be evaluated because the manuscript supplies no counts of how many problems were revised after agent feedback, no breakdown of which agent critiques prompted changes, and no baseline comparison without the agents. Without this linkage, the drop in reported realism issues cannot be attributed to the four-agent pipeline rather than teacher selection or independent editing.
[Abstract] Abstract and methods description: no information is provided on data collection procedures for teacher and student feedback, how open-ended comments were coded for 'realism issues' versus 'authenticity/fit' versus other categories, inter-rater reliability, or sample characteristics beyond the count of eight teachers. These omissions leave the quantitative and qualitative findings without visible evidential support.

minor comments (2)

[Abstract] The abstract states that 'issues with readability and mathematical hallucinations were also somewhat rare' but does not define the thresholds or coding scheme used to classify an issue as 'rare.'
The manuscript would benefit from a brief description of the exact prompts or criteria given to each of the four agents so that readers can assess how 'realism' and 'authenticity' were operationalized.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have revised the paper to address the concerns about evidential support and methodological transparency. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract and study description: the central claim that 'although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions' cannot be evaluated because the manuscript supplies no counts of how many problems were revised after agent feedback, no breakdown of which agent critiques prompted changes, and no baseline comparison without the agents. Without this linkage, the drop in reported realism issues cannot be attributed to the four-agent pipeline rather than teacher selection or independent editing.

Authors: We agree that the abstract claim benefits from explicit linkage to the underlying data. In the revised manuscript we have added a summary table reporting the total number of problems revised after agent feedback and a breakdown by agent (including how many realism critiques resulted in changes). We also include illustrative examples showing specific agent flags and the corresponding teacher edits. As this was an observational field study of the deployed system in authentic classrooms rather than a controlled experiment, we did not collect data under a no-agent baseline condition; we have now explicitly noted this as a limitation in the Discussion and clarified that the reported pattern is descriptive (high agent flagging during generation versus low incidence of realism issues in final teacher- and student-reported versions). We maintain that the observed discrepancy still offers useful evidence of the pipeline's practical value in prompting teacher attention to realism concerns. revision: yes
Referee: [Abstract] Abstract and methods description: no information is provided on data collection procedures for teacher and student feedback, how open-ended comments were coded for 'realism issues' versus 'authenticity/fit' versus other categories, inter-rater reliability, or sample characteristics beyond the count of eight teachers. These omissions leave the quantitative and qualitative findings without visible evidential support.

Authors: We acknowledge these omissions in the original submission. The revised Methods section now provides a complete account of data collection: teacher feedback was logged via the system interface and supplemented by post-session surveys, while student feedback was obtained through optional free-text comments within the ASSISTments platform. We describe the coding protocol used to categorize comments into realism issues, authenticity/fit concerns, readability, mathematical accuracy, and other categories, with example excerpts for each. We report the inter-rater reliability achieved by the two coders on a sampled subset of comments. Sample characteristics have been expanded to include teachers' years of experience, grade levels taught, and school contexts. These additions make the quantitative and qualitative results traceable to the raw data. revision: yes

standing simulated objections not resolved

The study design does not include a no-agent baseline condition, so direct causal attribution of the reduction in realism issues to the four-agent pipeline cannot be provided.

Circularity Check

0 steps flagged

No circularity: purely descriptive empirical report with no derivations or self-referential claims

full rationale

The paper reports on an empirical deployment of a multi-agent LLM system for math problem generation, describing teacher creation of 212 problems and qualitative feedback on desired modifications. No equations, fitted parameters, predictions, uniqueness theorems, or first-principles derivations are present. Central observations (agents flagged realism issues during generation; final versions showed few such issues per teachers/students) are presented as direct outcomes of the user study data rather than any chain that reduces to its own inputs by definition or self-citation. The work is self-contained as an observational report and contains no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical user study with no mathematical models, fitted parameters, or new postulated entities; it rests on standard domain assumptions about the validity of teacher and student self-reports in educational research.

axioms (1)

domain assumption Teacher and student self-reports about desired modifications accurately reflect underlying issues with problem authenticity and fit.
The central finding that authenticity and fit remain problematic is drawn directly from these reports.

pith-pipeline@v0.9.0 · 5491 in / 1261 out tokens · 39058 ms · 2026-05-10T15:38:43.280845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Taşdelen, O., Bodemer, D.: Generative AI in the classroom: Effects of context-personalized learning material and tasks on motivation and performance. Int. J. Artif. Intell. Educ. 35, 3049-3070 (2025). https://doi.org/10.1007/s40593-025-00491-9 Multi-Agent Personalized Problem Generation 9

work page doi:10.1007/s40593-025-00491-9 2025
[2]

UNESCO, Paris (2023)

UNESCO: Guidance for Generative AI in Education and Research. UNESCO, Paris (2023)

2023
[3]

AMTE Connections (2024)

Sawyer, A.G., Aga, Z.G.: Counterexamples to demonstrate artificial intelligence chatbot’s lack of knowledge in the mathematics education classroom. AMTE Connections (2024)

2024
[4]

Chen, B., Cheng, J., Wang, C., Leung, V.: Pedagogical biases in AI -powered educational tools: The case of lesson plan generators. Soc. Innov. J. 30, 1–8 (2025)

2025
[5]

AMTE Connections (Summer 2024)

Beauchamp, T., Walkington, C.: Mathematics teachers using generative AI to personalize instruction to students’ interests. AMTE Connections (Summer 2024)

2024
[6]

Education Writers Association (8 Oct 2025)

Center for Democracy & Technology: CDT survey research finds use of AI in K-12 schools connected to negative effects on students, including their real -life relationships. Education Writers Association (8 Oct 2025)

2025
[7]

Journal of Mathematics Teacher Education (2026)

Walkington, C., Beauchamp, T., Pruitt -Britton, T., Feng, M.: Mathematics teachers using generative AI to pose math problems related to students’ interests. Journal of Mathematics Teacher Education (2026). https://doi.org/10.1007/s10857-026-09743-4

work page doi:10.1007/s10857-026-09743-4 2026
[8]

Hidi, S., Renninger, K.A.: The four -phase model of interest development. Educ. Psychol. 41(2), 111–127 (2006)

2006
[9]

Bernacki, M.L., Walkington, C.: The role of situational interest in personalized learning. J. Educ. Psychol. 110(6), 864–881 (2018). https://doi.org/10.1037/edu0000250

work page doi:10.1037/edu0000250 2018
[10]

Goldstone, R.L., Son, J.Y.: The transfer of scientific principles using concrete and idealized simulations. J. Learn. Sci. 14(1), 69–110 (2005)

2005
[11]

Doctoral dissertation, University of North Carolina at Chapel Hill Graduate School, Chapel Hill, NC, USA (2016)

Kosh, A.: The Effects on Mathematics Performance of Personalizing Word Problems to Stu- dents’ Interests. Doctoral dissertation, University of North Carolina at Chapel Hill Graduate School, Chapel Hill, NC, USA (2016). https://doi.org/10.17615/gf0n-we97

work page doi:10.17615/gf0n-we97 2016
[12]

Palm, T.: Impact of authenticity on sense making in word problem solving. J. Math. Behav. 27(1), 37–58 (2008)

2008
[13]

personalizing

Walkington, C., Bernacki, M.L.: Motivating students by “personalizing” learning around individual interests: A consideration of theory, design and implementation issues. In: Kara- benick, S.A., Urdan, T.C. (eds.) Interventions. Advances in Motivation and Ach ievement, vol. 18, pp. 139–176. Emerald Group Publishing, Bingley (2014)

2014
[14]

Cummins, D.D., Kintsch, W., Reusser, K., Weimer, R.: The role of understanding in solving word problems. Cogn. Psychol. 20(4), 405 –438 (1988). https://doi.org/10.1016/0010 - 0285(88)90011-4

work page doi:10.1016/0010 1988
[15]

In: Proceedings of the 18th International Con- ference on Educational Data Mining (EDM 2025) (2025)

Norberg, K., Almoubayyed, H., Fancsali, S.: Linguistic features predicting math word prob- lem readability among less‐skilled readers. In: Proceedings of the 18th International Con- ference on Educational Data Mining (EDM 2025) (2025). https://educationaldatam in- ing.org/EDM2025/proceedings/2025.EDM.short-papers.189/

2025
[16]

EDUMATH: Generating Standards-aligned Educational Math Word Problems

Christ, B.R., Molitz, P., Kropko, J., Hartvigsen, T.: EDUMATH: Generating standards - aligned educational math word problems. arXiv:2510.06965 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

In: Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S

Walkington, C., Beauchamp, T., Lan, A., Pruitt-Britton, T.: The efficiency of teacher-driven context personalization in mathematics with large language models. In: Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S. (eds.) Artificial Intelligence in Education. AIED
[18]

15880, pp

Lecture Notes in Computer Science, vol. 15880, pp. 77–91. Springer, Cham (2025)

2025
[19]

Heffernan, N.T., Heffernan, C.L.: The ASSISTments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. Int. J. Artif. Intell. Educ. 24(4), 470–497 (2014)

2014
[20]

Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77– 101 (2006)

2006
[21]

Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh -Metrix: Analysis of text on cohesion and language. Behav. Res. Methods Instrum. Comput. 36(2), 193 –202 (2004). https://doi.org/10.3758/BF03195564

work page doi:10.3758/bf03195564 2004