pith. the verified trust layer for science. sign in

arxiv: 2602.13280 · v2 · submitted 2026-02-06 · 💻 cs.AI

BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation

Pith reviewed 2026-05-16 07:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords student simulationself-regulated learningBayesian knowledge tracingneuro-symbolic agentsprogramming educationlearner emulationTuring test
0
0 comments X p. Extension

The pith

BEAGLE creates simulated student learning behaviors in programming that are indistinguishable from real novice data in human tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEAGLE as a framework to simulate how novice students tackle open-ended programming tasks. It builds on self-regulated learning theory by using a semi-Markov model to time cognitive and metacognitive steps, injecting explicit flaws into knowledge tracking to create gaps and unknowns, and decoupling high-level strategy from code generation so errors remain intentional. This counters the tendency of large language models to produce overly competent and efficient solutions instead of the messy, iterative process of real learners. If the approach holds, researchers could generate large volumes of realistic learning data without the costs or privacy issues of collecting authentic student traces. The central evidence is a human Turing test in which participants classified the simulated traces at chance levels.

Core claim

BEAGLE integrates a semi-Markov model for behavior timing and transitions, Bayesian Knowledge Tracing with flaw injection to enforce realistic knowledge gaps, and a decoupled agent design that separates strategy use from code actions. On Python programming tasks this produces trajectories that participants in a human Turing test could not reliably distinguish from real student data, yielding classification accuracy statistically equivalent to chance at 52.8 percent.

What carries the argument

The neuro-symbolic architecture that combines a semi-Markov model for cognitive and metacognitive behavior transitions, flaw-injected Bayesian Knowledge Tracing for knowledge gaps, and a decoupled agent that separates high-level strategy from code generation.

If this is right

  • Adaptive tutoring systems can be trained and evaluated using generated trajectories instead of scarce real student logs.
  • Pedagogical interventions can be stress-tested across many simulated learner paths that exhibit realistic error patterns.
  • Privacy risks in education research decrease because authentic longitudinal data collection can be replaced by synthetic equivalents.
  • The framework demonstrates that enforcing iterative struggle and intentional mistakes improves fidelity over standard language-model simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same behavior-enforcement approach could be adapted to simulate novice learners in non-programming domains such as mathematics problem solving.
  • Integrating the simulator into live tutoring loops might allow it to refine its own knowledge-gap modeling from observed student responses over time.
  • The generated traces could serve as a benchmark for measuring how well other AI systems capture the timing and sequence of metacognitive shifts in learning.

Load-bearing premise

That operationalizing self-regulated learning theory through the semi-Markov model, flaw-injected Bayesian knowledge tracing, and decoupled agent design will produce trajectories that are distributionally and behaviorally indistinguishable from authentic novice learners.

What would settle it

Running the same Turing test on additional independent sets of real and simulated Python programming traces with new participant groups to determine whether classification accuracy remains at chance levels.

Figures

Figures reproduced from arXiv: 2602.13280 by Clayton Cohn, Gautam Biswas, Hanchen David Wang, Meiyi Ma, Siyuan Guo, Zifan Xu.

Figure 1
Figure 1. Figure 1: Competency bias: real students debug persistently (D→D: 28%) while LLMs construct linearly. C (Constructing), D (Debugging), A (Assessing). Self-Regulated Learning (SRL) theory [55] provides a principled framework for this view, positing that learn￾ing unfolds as a cyclic process involving cognition (at￾tempting tasks), metacognition (monitoring progress), and motivation and affect (engagement and seeking … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BEAGLE agent architecture. Symbolic Control (left) governs high-level behavior via semi-Markov metacognitive and cognitive behavior state machines, with BKT-based knowledge tracking. Neural Action (right) factors generation into a two-stage pipeline: a Strategist that emits a goal, mindset, and directive, and an Executor that produces code and monologue conditioned on that directive (prompt… view at source ↗
Figure 3
Figure 3. Figure 3: Cognitive behavior transitions. BEAGLE (blue) closely matches real data. P→P E→E P→E E→P P→M M→P P→R R→P 0 25 5090 100 Freq. (%) Real BEAGLE Vanilla+M CoT+M FS+M [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Behavioral DNA of cognitive be￾haviors. Longer contiguous blocks = closer to authentic behavior; bolded segments are debugging loops. The linear construction trap manifests clearly in our base￾lines. Vanilla solves problems in ≈6 steps at 100% accu￾racy ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: BKT mastery growth conditioned on dominant strategy (Particle Sim., pooled across six backbones). Orange = low performer, blue = high performer; shaded halos show ±1σ across runs. Green capsules: rubric mastery from the Cohn block-based study (Inter-Quartile Range with median dot). teristic debugging loops (Details in App [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Metacognitive distribution. HIGH succeed via PLANNING; LOW trap in ENACTING. Performance Differentiation. Beyond aggregate fidelity, BEAGLE separates HIGH and LOW profiles by strategy: HIGH performers allocate 72% of steps to PLANNING and MONITORING, while LOW performers spend 50.8% trapped in ENACTING ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Turing test SDT and partici￾pant composition (%); N=71. In the LLM-as-Judge evaluation, BEAGLE traces achieve the highest realism score (2.44/3.00, [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-task generalization (BEAGLE vs. best non-BEAGLE baseline, per task). • Best baseline (CoT+M or CoderAgent, whichever is stronger per cell); • BEAGLE. Connecting lines show the gap; translucent halos show ±1 std. All simulations are generated using Gemini 2.5 Flash. Across the four tasks, BEAGLE shrinks behavioral divergence by ≥ 3×, raises error recurrence to ≥ 79%, and improves all three perceptual … view at source ↗
Figure 10
Figure 10. Figure 10: Cross-model generalization on Particle Simulator (N = 50, T = 30). Bars show BEAGLE’s fidelity scores for each backbone across the same five metrics as [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: EFI forces authentic knowledge gaps. Without EFI, the agent trivially uses math.radians(). With EFI, the agent improvises (incorrect) manual approximations. Full comparison in App [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of temporal gaps between consecutive actions. The 30-second threshold (dashed) [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cognitive transition dynamics within MONITORING. LOW performers exhibit “debugging loops” (87% self-loop), while HIGH performers pivot effectively (60% escape rate). P E M R 0 0.5 1 1.5 Geometric (≈0.95) +42% -39% 1.21 0.69 0.8 0.79 0.77 1.35 0.58 CV 0.63 High Low [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Actual CV of segment durations compared to geometric prediction (gray band, [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Interrupt probability distributions. ASSISTANCE peaks mid-session (µ = 0.5) while OFF-TOPIC peaks late (µ = 0.73). High performers (solid) request assistance more; low performers (dashed) disengage more. A.3 Interrupt Modeling Beyond the core metacognitive behaviors (Planning, Monitoring, Reflecting, Enacting), student behavior includes two interrupt states that temporally punctuate the learning flow: 1. … view at source ↗
Figure 16
Figure 16. Figure 16: Extended behavioral DNA comparison across all baselines. Each row shows cognitive behavior [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Navigating unknown unknowns via EFI. Without the flaw, the agent uses standard library functions. [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: BKT-behavior alignment. Test pass rates correlate with knowledge probability [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Distribution of ENACTING →ENACTING self-loop rates by ρbehav. Despite significant group-level differences (p < .001), distributions overlap; the realized transition pattern, not the assigned profile, determines performance. while acknowledging that findings should be validated with human learners before drawing strong conclusions about tutor design. H Human Turing Test To rigorously assess perceptual fide… view at source ↗
Figure 20
Figure 20. Figure 20: Confusion matrix showing response counts and percentages. The high False Alarm rate (28.3%) [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
read the original abstract

Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research, from training adaptive tutoring systems to stress-testing pedagogical interventions. However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies. While Large Language Models (LLMs) offer a promising path to student simulation, they suffer from competency bias, optimizing for efficient correctness rather than the erratic, iterative struggle characteristic of novice learners. We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture. BEAGLE integrates three key technical innovations: (1) a semi-Markov model that governs the timing and transitions of cognitive behaviors and metacognitive behaviors; (2) Bayesian Knowledge Tracing with explicit flaw injection to enforce realistic knowledge gaps and "unknown unknowns"; and (3) a decoupled agent design that separates high-level strategy use from code generation actions to prevent the model from silently correcting its own intentional errors. In evaluations on Python programming tasks, BEAGLE significantly outperforms state-of-the-art baselines in reproducing authentic trajectories. In a human Turing test, participants could not reliably tell BEAGLE traces apart from real student data: classification accuracy was statistically equivalent to chance (52.8%, d' = 0.15, N = 71)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces BEAGLE, a neuro-symbolic agent framework for emulating novice student behavior in open-ended Python programming tasks. It combines a semi-Markov model for timing and transitions of cognitive and metacognitive behaviors, Bayesian Knowledge Tracing with explicit flaw injection to model knowledge gaps and unknown unknowns, and a decoupled architecture separating high-level strategy from low-level code actions. The central claim is that these components mitigate LLM competency bias, enabling BEAGLE to generate trajectories that outperform baselines and are statistically indistinguishable from real student data in a human Turing test (52.8% classification accuracy, d'=0.15, N=71).

Significance. If the indistinguishability result holds under more rigorous quantitative validation, the work offers a practical method for generating privacy-preserving synthetic student data. This could support training of adaptive tutors, stress-testing of pedagogical interventions, and controlled experiments in education research. The explicit grounding in SRL theory and the three technical mechanisms for enforcing realistic struggle represent a clear advance over purely prompt-based LLM simulators.

major comments (1)
  1. [Evaluation] Evaluation section: The headline indistinguishability claim rests entirely on the human Turing test (52.8% accuracy, d'=0.15, N=71). No quantitative comparisons are reported for the load-bearing observables that the three technical components are designed to control, such as semi-Markov dwell-time distributions, metacognitive transition matrices, or knowledge-gap lifetime histograms between BEAGLE traces and real student logs. Without these matches, the Turing-test outcome alone cannot confirm that the semi-Markov timing model, flaw-injected BKT, and decoupled design removed systematic LLM biases rather than merely producing traces that non-expert judges could not distinguish in short, decontextualized presentations.
minor comments (2)
  1. [Abstract] The abstract states that BEAGLE 'significantly outperforms state-of-the-art baselines' but supplies no table or figure reference for the specific metrics, effect sizes, or statistical tests used in that comparison.
  2. [Method] Notation for the semi-Markov transition and duration parameters is introduced without an explicit equation or parameter table, making it difficult to assess how many free parameters are involved or how they were fit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their careful reading and valuable feedback. We address the major comment on the evaluation section below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline indistinguishability claim rests entirely on the human Turing test (52.8% accuracy, d'=0.15, N=71). No quantitative comparisons are reported for the load-bearing observables that the three technical components are designed to control, such as semi-Markov dwell-time distributions, metacognitive transition matrices, or knowledge-gap lifetime histograms between BEAGLE traces and real student logs. Without these matches, the Turing-test outcome alone cannot confirm that the semi-Markov timing model, flaw-injected BKT, and decoupled design removed systematic LLM biases rather than merely producing traces that non-expert judges could not distinguish in short, decontextualized presentations.

    Authors: We thank the referee for highlighting this important aspect of validation. The manuscript does report that BEAGLE outperforms baselines in reproducing authentic trajectories, but we concede that explicit matches to the specific observables (dwell times, transition matrices, knowledge gap histograms) are not presented. We agree this would provide stronger support for the claim that our three components mitigate LLM competency bias. In the revised manuscript, we will add these quantitative comparisons, showing alignment between BEAGLE and real data on these metrics. This will complement the Turing test results and address the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: model architecture drawn from external SRL theory and evaluated via independent human test

full rationale

The paper constructs BEAGLE by embedding Self-Regulated Learning theory into a semi-Markov timing model, flaw-injected Bayesian Knowledge Tracing, and a decoupled strategy-action agent. These components are motivated by cited external educational psychology literature rather than by any equation or parameter that is later used to define the Turing-test success metric. The reported human classification result (52.8 % accuracy, d' = 0.15) functions solely as an external validation step and is not referenced in the model's design equations or fitting procedure. No self-citation chain, self-definitional loop, or fitted-input-renamed-as-prediction appears in the derivation. The central claim therefore remains independent of its own evaluation outcome.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that SRL theory can be faithfully translated into the listed mechanisms to override LLM competency bias, plus the assumption that the resulting traces will match real student distributions.

free parameters (2)
  • semi-Markov transition and duration parameters
    Timing and switching probabilities between cognitive and metacognitive behaviors must be set or fitted to produce realistic trajectories.
  • flaw injection rates in Bayesian Knowledge Tracing
    Parameters controlling the probability and type of knowledge gaps and unknown unknowns are required to enforce realistic errors.
axioms (1)
  • domain assumption Self-Regulated Learning theory provides an accurate and sufficient model of novice learner cognitive and metacognitive behaviors
    The entire framework is built on incorporating SRL to counteract competency bias.

pith-pipeline@v0.9.0 · 5552 in / 1449 out tokens · 36471 ms · 2026-05-16T07:05:48.137621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

  1. [1]

    Using large language models to simulate multiple humans and replicate human subject studies

    Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InICML, pages 337–371. PMLR, 2023

  2. [2]

    +5 Million Python & Bash Programming Submissions for 5 Courses & Grades for Computer-Based Exams over 3 academic years., 7 2020

    David Azcona and Alan Smeaton. +5 Million Python & Bash Programming Submissions for 5 Courses & Grades for Computer-Based Exams over 3 academic years., 7 2020. URL https://figshare.com/articles/dataset/_5_Million_Python_Bash_Programming_Submissions_ for_5_Courses_Grades_for_Computer-Based_Exams_over_3_academic_years_/12610958

  3. [3]

    Wheel-spinning: Students who fail to master a skill

    Joseph E Beck and Yue Gong. Wheel-spinning: Students who fail to master a skill. InAIED, pages 431–440. Springer, 2013

  4. [4]

    Language bottleneck models: A framework for interpretable knowledge tracing and beyond.arXiv preprint arXiv:2506.16982, 2025

    Antonin Berthon and Mihaela van der Schaar. Language bottleneck models: A framework for interpretable knowledge tracing and beyond.arXiv preprint arXiv:2506.16982, 2025

  5. [5]

    Jessica Blom-Hoffman, Stephen S Leff, Debra L Franko, Elana Weinstein, Kelly Beakley, and Thomas J Power. Consent procedures and participation rates in school-based intervention and prevention research: using a multi-component, partnership-based approach to recruit participants.School mental health, 1(1):3–15, 2009

  6. [6]

    Beyond the hint: Using self-critique to constrain llm feedback in conversation-based assessment

    Tyler Burleigh, Jenny Han, and Kristen DiCerbo. Beyond the hint: Using self-critique to constrain llm feedback in conversation-based assessment. InProceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers, pages 79–85, 2025

  7. [7]

    Drawsim-pd: Simulating student science drawings to support ngss-aligned teacher diagnostic reasoning.arXiv preprint arXiv:2602.01578, 2026

    Arijit Chakma, Peng He, Honglu Liu, Zeyuan Wang, Tingting Li, Tiffany D Do, and Feng Liu. Drawsim-pd: Simulating student science drawings to support ngss-aligned teacher diagnostic reasoning.arXiv preprint arXiv:2602.01578, 2026

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Compost: Characterizing and evaluating caricature in llm simulations

    Myra Cheng, Tiziano Piccardi, and Diyi Yang. Compost: Characterizing and evaluating caricature in llm simulations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10853–10875, 2023

  10. [10]

    Evidence-decision-feedback: Theory-driven adaptive scaffolding for llm agents.arXiv preprint arXiv:2602.01415, 2026

    Clayton Cohn, Siyuan Guo, Surya Rayala, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Angela Eeds, Menton Deweese, Pamela J Osborn Popp, et al. Evidence-decision-feedback: Theory-driven adaptive scaffolding for llm agents.arXiv preprint arXiv:2602.01415, 2026

  11. [11]

    A theory of adaptive scaffolding for llm-based pedagogical agents

    Clayton Cohn, Surya Rayala, Namrata Srivastava, Joyce Horn Fonteles, Shruti Jain, Xinying Luo, Divya Mereddy, Naveeduddin Mohammed, and Gautam Biswas. A theory of adaptive scaffolding for llm-based pedagogical agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 1757–1765, 2026

  12. [12]

    Knowledge tracing: Modeling the acquisition of procedural knowledge.User modeling and user-adapted interaction, 4(4):253–278, 1994

    Albert T Corbett and John R Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge.User modeling and user-adapted interaction, 4(4):253–278, 1994

  13. [13]

    Bootstrap methods: another look at the jackknife

    Bradley Efron. Bootstrap methods: another look at the jackknife. InBreakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992

  14. [14]

    Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent

    Joyce Ehrlinger, Kerri Johnson, Matthew Banner, David Dunning, and Justin Kruger. Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational behavior and human decision processes, 105(1):98–121, 2008

  15. [15]

    Semi-markov model for simulating mooc students

    Louis Faucon, Lukasz Kidzinski, and Pierre Dillenbourg. Semi-markov model for simulating mooc students. InEDM, pages 358–363, 2016

  16. [16]

    Agent4edu: Generating learner response data by generative agents for intelligent education systems

    Weibo Gao, Qi Liu, Linan Yue, Fangzhou Yao, Rui Lv, Zheng Zhang, Hao Wang, and Zhenya Huang. Agent4edu: Generating learner response data by generative agents for intelligent education systems. InAAAI, volume 39, pages 23923–23932, 2025. 10

  17. [17]

    Modeling student behavior with two-layer hidden markov models.JEDM, 9(1):1–24, 2017

    Chase Geigle and ChengXiang Zhai. Modeling student behavior with two-layer hidden markov models.JEDM, 9(1):1–24, 2017

  18. [18]

    Ethics of ai in education: Towards a community-wide framework.Interna- tional Journal of Artificial Intelligence in Education, 32(3):504–526, 2022

    Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby Baker, Si- mon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu Cukurova, Ig Ibert Bittencourt, et al. Ethics of ai in education: Towards a community-wide framework.Interna- tional Journal of Artificial Intelligence in Education, 32(3):504–526, 2022

  19. [19]

    Quantifying the persona effect in llm simulations

    Tiancheng Hu and Nigel Collier. Quantifying the persona effect in llm simulations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 10289–10307, 2024

  20. [20]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

  21. [21]

    C2stem: A system for synergistic learning of physics and computational thinking.Journal of Science Education and Technology, 29(1):83–100, 2020

    Nicole M Hutchins, Gautam Biswas, Miklós Maróti, Ákos Lédeczi, Shuchi Grover, Rachel Wolf, Kristen Pilner Blair, Doris Chin, Luke Conlin, Satabdi Basu, et al. C2stem: A system for synergistic learning of physics and computational thinking.Journal of Science Education and Technology, 29(1):83–100, 2020

  22. [22]

    Tanja Käser and Giora Alexandron. Simulated learners in educational technology: A systematic literature review and a turing-like test.International Journal of Artificial Intelligence in Education, 34(2):545–585, 2024

  23. [23]

    Investigating self- regulated learning in teachable agent environments

    John S Kinnebrew, Gautam Biswas, Brian Sulcer, and Roger S Taylor. Investigating self- regulated learning in teachable agent environments. InInternational handbook of metacognition and learning technologies, pages 451–470. Springer, 2013

  24. [24]

    Understanding the Effects of RLHF on LLM Generalisation and Diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023

  25. [25]

    Llm- itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education

    Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, and Arto Hellas. Llm- itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education. InProceedings of the 27th Australasian Computing Education Conference, pages 56–63, 2025

  26. [26]

    Priority guided explanation for knowledge tracing with dual ranking and similarity consistency

    Fan Li, Tiancheng Zhang, Yifang Yin, Minghe Yu, Mengxiang Wang, and Ge Yu. Priority guided explanation for knowledge tracing with dual ranking and similarity consistency. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 430–438, 2025

  27. [27]

    Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

    Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, and Tianyi Zhou. Can llms estimate student struggles? human-ai difficulty alignment with proficiency simulation for item difficulty prediction.arXiv preprint arXiv:2512.18880, 2025

  28. [28]

    Turning real-time analytics into adaptive scaffolds for self-regulated learning using generative artificial intelligence

    Tongguang Li, Debarshi Nath, Yixin Cheng, Yizhou Fan, Xinyu Li, Mladen Rakovi´c, Hassan Khosravi, Zachari Swiecki, Yi-Shan Tsai, and Dragan Gaševi´c. Turning real-time analytics into adaptive scaffolds for self-regulated learning using generative artificial intelligence. In Proceedings of the 15th International Learning Analytics and Knowledge Conference,...

  29. [29]

    Do llms make mistakes like students? exploring natural alignments between language models and human error patterns

    Naiming Liu, Shashank Sonkar, and Richard Baraniuk. Do llms make mistakes like students? exploring natural alignments between language models and human error patterns. InAIED, pages 364–377. Springer, 2025

  30. [30]

    Ekt: Exercise-aware knowledge tracing for student performance prediction

    Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. Ekt: Exercise-aware knowledge tracing for student performance prediction. InIEEE Transactions on Knowledge and Data Engineering, volume 33, pages 100–115, 2021

  31. [31]

    Generative students: Using llm-simulated student profiles to support question item evaluation

    Xinyi Lu and Xu Wang. Generative students: Using llm-simulated student profiles to support question item evaluation. InProceedings of the Eleventh ACM Conference on Learning@ Scale, pages 16–27, 2024. 11

  32. [32]

    Can llms reliably simulate human learner actions? a simulation authoring framework for open-ended learning environments

    Amogh Mannekote, Adam Davies, Jina Kang, and Kristy Elizabeth Boyer. Can llms reliably simulate human learner actions? a simulation authoring framework for open-ended learning environments. InAAAI, volume 39, pages 29044–29052, 2025

  33. [33]

    Predicting students’ performance with simstudent: Learning cognitive skills from observation.Frontiers in Artificial Intelligence and Applications, 158:467, 2007

    Noboru Matsuda, William W Cohen, Jonathan Sewall, Gustavo Lacerda, and Kenneth R Koedinger. Predicting students’ performance with simstudent: Learning cognitive skills from observation.Frontiers in Artificial Intelligence and Applications, 158:467, 2007

  34. [34]

    Interrater reliability: the kappa statistic.Biochemia medica, 22(3):276–282, 2012

    Mary L McHugh. Interrater reliability: the kappa statistic.Biochemia medica, 22(3):276–282, 2012

  35. [35]

    Large language models for in-context student modeling: Synthesizing student’s behavior in visual programming.arXiv preprint arXiv:2310.10690, 2023

    Manh Hung Nguyen, Sebastian Tschiatschek, and Adish Singla. Large language models for in-context student modeling: Synthesizing student’s behavior in visual programming.arXiv preprint arXiv:2310.10690, 2023

  36. [36]

    Python programming dataset, 2019

    Benjamin Paaßen. Python programming dataset, 2019. URL https://doi.org/10.4119/unibi/ 2941052. Dataset

  37. [37]

    Mapping python programs to vectors using recursive neural encodings.Journal of Educational Data Mining, 13(3):1–35, 2021

    Benjamin Paaßen, Jessica McBroom, Bryn Jeffries, Irena Koprinska, and Kalina Yacef. Mapping python programs to vectors using recursive neural encodings.Journal of Educational Data Mining, 13(3):1–35, 2021. doi: 10.5281/zenodo.5634224

  38. [38]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), pages 1–22, 2023

  39. [39]

    Towards the pedagogical steering of large language models for tutoring: A case study with modeling productive failure

    Romain Puech, Jakub Macina, Julia Chatain, Mrinmaya Sachan, and Manu Kapur. Towards the pedagogical steering of large language models for tutoring: A case study with modeling productive failure. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26291–26311, 2025

  40. [40]

    Recursive introspection: Teaching language model agents how to self-improve.NIPS, 37:55249–55285, 2024

    Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve.NIPS, 37:55249–55285, 2024

  41. [41]

    Pharmasimtext: A text-based educational playground filled with rl-llm agents that work together even in disagreement.JEDM, 17(1): 1–40, 2025

    Bahar Radmehr, Tanja Kaser, and Adish Singla. Pharmasimtext: A text-based educational playground filled with rl-llm agents that work together even in disagreement.JEDM, 17(1): 1–40, 2025

  42. [42]

    Character-llm: A trainable agent for role-playing

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing. InEMNLP, 2023

  43. [43]

    Analyzing students collaborative problem-solving behaviors in synergistic stem+c learning

    Caitlin Snyder, Nicole M Hutchins, Clayton Cohn, Joyce Horn Fonteles, and Gautam Biswas. Analyzing students collaborative problem-solving behaviors in synergistic stem+c learning. In Proceedings of the 14th Learning Analytics and Knowledge Conference, pages 540–550, 2024

  44. [44]

    Systematic biases in llm simulations of debates

    Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in llm simulations of debates. InEMNLP, 2024

  45. [45]

    Privacy-preserving synthetic educational data generation

    Jill-Jênn Vie, Tomas Rigaux, and Sein Minn. Privacy-preserving synthetic educational data generation. InEuropean Conference on Technology Enhanced Learning, pages 393–406. Springer, 2022

  46. [46]

    Translating a math word problem to an expression tree

    Lei Wang, Yan Wang, Deng Cai, Dongxiang Zhang, and Xiaojiang Liu. Translating a math word problem to an expression tree. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1064–1069, 2017

  47. [47]

    Development and illustration of a framework for computational thinking practices in introductory physics

    Daniel P Weller, Theodore E Bott, Marcos D Caballero, and Paul W Irving. Development and illustration of a framework for computational thinking practices in introductory physics. Physical Review Physics Education Research, 18(2):020106, 2022

  48. [48]

    Winne and Allyson F

    Philip H. Winne and Allyson F. Hadwin. Studying as self-regulated learning.Metacognition in Educational Theory and Practice, 93:277–304, 1998. 12

  49. [49]

    Leveraging generative artificial intelligence to simulate student learning behavior.arXiv preprint arXiv:2310.19206, 2023

    Songlin Xu and Xinyu Zhang. Leveraging generative artificial intelligence to simulate student learning behavior.arXiv preprint arXiv:2310.19206, 2023

  50. [50]

    Eduagent: Generative student agents in learning

    Songlin Xu, Xinyu Zhang, and Lianhui Qin. Eduagent: Generative student agents in learning. arXiv preprint arXiv:2404.07963, 2024

  51. [51]

    Classroom simulacra: Building contextual student generative agents in online education for learning behavioral simulation

    Songlin Xu, Hao-Ning Wen, Hongyi Pan, Dallas Dominguez, Dongyin Hu, and Xinyu Zhang. Classroom simulacra: Building contextual student generative agents in online education for learning behavioral simulation. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26, 2025

  52. [52]

    Huiyan Ye, Biyao Liang, Oi-Lam Ng, and Ching Sing Chai. Integration of computational think- ing in k-12 mathematics education: A systematic review on ct-based mathematics instruction and student learning.International Journal of STEM Education, 10(1):3, 2023

  53. [53]

    Towards valid student simulation with large language models.preprint, 2026

    Zhihao Yuan, Yunze Xiao, Ming Li, Weihao Xuan, Richard Tong, Mona Diab, and Tom Mitchell. Towards valid student simulation with large language models.preprint, 2026. URL https://arxiv.org/abs/2601.05473

  54. [54]

    Coderagent: Simulating student behavior for personalized programming learning with large language models

    Yi Zhan, Qi Liu, Weibo Gao, Zheng Zhang, Tianfu Wang, Shuanghong Shen, Junyu Lu, and Zhenya Huang. Coderagent: Simulating student behavior for personalized programming learning with large language models. In James Kwok, editor,IJCAI-25, pages 293–301. IJCAI, 8 2025. doi: 10.24963/ijcai.2025/34. URL https://doi.org/10.24963/ijcai.2025/34. Main Track

  55. [55]

    debugging loops

    Barry J Zimmerman. Self-regulated learning and academic achievement: An overview.Educa- tional psychologist, 25(1):3–17, 1990. 13 Table 3: Complete notation reference. Symbol Domain Description Semantic Domains Stext Universe Set of all possible natural language strings Scode Universe Set of all possible executable source codes A S code × Stext Action spa...

  56. [56]

    Check termination conditions (success, max steps)

  57. [57]

    Checkjust_received_helpflag→skip interrupts

  58. [58]

    SampleP(Off-Topic)→enter Off-Topic if triggered

  59. [59]

    SampleP(Assistance)→enter Assistance if triggered

  60. [60]

    rabbit hole

    Normal semi-Markov transition OFF-TOPIC is checkedfirstbecause it represents complete disengagement as a student who is truly distracted will not think to ask for help. Assistance represents partial engagement where the student is stuck but still cognitively active. Assistance Flow (Two-Turn Protocol).The Assistance state spans two simulation turns to cap...

  61. [61]

    # hope this works

    Imperfect Code Style: Cramped spacing (x=y), single-letter variables, and inline comments (“# hope this works”)

  62. [62]

    Ugh”), relief (“Finally!

    Emotional Authenticity: Expressions of frustration (“Ugh”), relief (“Finally!”), or uncer- tainty

  63. [63]

    AI SIMULATIONTELLS(FLAG ASFAKE)

    Non-Linearity: A messy workflow (Constructing → Debugging → Constructing) rather than a clean linear path. AI SIMULATIONTELLS(FLAG ASFAKE)

  64. [64]

    I need to fix the TypeError

    Psychic Debugging: Identifying runtime errors (e.g., “I need to fix the TypeError”)before running the code. 27 2.Perfect Code Style: PEP-8 compliance, descriptive variable names, or proper docstrings. 3.Robotic Explanations: Overly precise language (e.g., “The function signature requires...”). 4.Amnesia: Repeating the exact same mistake 5+ times without v...

  65. [65]

    Two 5-point Likert ratings:Behavioral Realism(coding patterns, mistakes, progression) and Code Realism(logic, structure, style)

  66. [66]

    A binary classification:Real StudentorAI Generated

  67. [67]

    real students follow clear thinking patterns... while AI samples generate in no particular order

    Optional free-text reasoning. After completing all samples, raters reported difficulty level (1–5), their CS background, and qualita- tive feedback on differentiation cues. No timestamps or monologue text were shown, matching the information available in the original Bielefeld dataset. H.2 Statistical Analysis We performed a comprehensive statistical anal...