Recognition: no theorem link
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Pith reviewed 2026-05-14 21:18 UTC · model grok-4.3
The pith
Serializing student logs into conversations trains open models to simulate realistic programming learner behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing each student's problem-solving process as a dialogue between the learner and the automated assessment system, where submissions and feedback like test outcomes form alternating turns, models can be trained to replicate authentic student debugging behavior more effectively than prior methods.
What carries the argument
Conversational serialization of temporal student log traces into alternating turns of code submissions and environment feedback.
If this is right
- Trained models can replicate iterative debugging processes more accurately when environment feedback is included.
- Open-weight models become viable alternatives to prompted proprietary LLMs for student simulation.
- Scalable evaluation of tutoring strategies becomes possible using realistic artificial students.
- The training pipeline of supervised fine-tuning combined with preference optimization aligns models to real learner patterns.
- Privacy and cost concerns are reduced by avoiding large proprietary models and enabling local training.
Where Pith is reading between the lines
- Similar serialization techniques might apply to simulating learners in other domains like math or writing if process logs are available.
- These models could be used to generate synthetic datasets for training better tutoring systems.
- Testing on different programming languages or assignment types could reveal how general the approach is.
- If the simulations are accurate, they might help identify common student misconceptions automatically.
Load-bearing premise
Converting student logs into conversational turns captures enough of the original context and intent so the model learns real behavior instead of just surface patterns.
What would settle it
Training the models both with and without environment feedback and finding no significant improvement in functional alignment or code similarity on a test set of real student traces would falsify the benefit of including feedback.
Figures
read the original abstract
Artificial students -- models that simulate how learners act and respond within educational systems -- are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, most existing approaches rely on prompting large, proprietary language models, limiting adaptability to specific courses and raising concerns around privacy, cost, and dependence. In this work, we propose a framework for training open-weight artificial programming learners directly from authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that serializing real student programming log traces into alternating conversational turns between learner and automated assessment system (incorporating code submissions and environment feedback such as test outcomes and error traces), followed by supervised fine-tuning plus preference optimization on open-weight Qwen 4B/8B models, produces artificial students that more faithfully replicate authentic debugging behavior. This yields measurable gains in functional alignment and code similarity over code-only baselines and prompted-LLM baselines on held-out student data.
Significance. If the central claim holds, the work supplies a reproducible, open-weight route to scalable artificial-student simulators that avoids proprietary-model dependence, improves privacy and cost, and directly leverages authentic process data. The public code release is a concrete strength that enables community verification and extension for tutoring-strategy evaluation.
major comments (2)
- [§5] §5 (Evaluation): the reported gains in functional alignment and code similarity are attributed to conversational serialization, yet no ablation isolates the alternating-turn format from the mere presence of feedback tokens. Without this comparison the attribution to the serialization step remains untested and the weakest assumption (preservation of debugging intent) is not directly addressed.
- [§4] §4 (Method): the serialization procedure collapses multi-turn edits, pauses, and exploratory dead-ends into single turns, but the manuscript provides neither a quantitative measure of information loss nor an analysis showing that the resulting dialogues retain decision order and implicit intent rather than surface co-occurrences.
minor comments (3)
- [§5.1] Exact prompt templates, model versions, and decoding parameters for the prompted-LLM baselines are not specified, hindering direct replication.
- [§5.2] Statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the functional-alignment and code-similarity deltas are absent.
- [Data section] Precise descriptions of the train/validation/test splits, number of unique students, and assignment distribution are missing from the data section.
Simulated Author's Rebuttal
We thank the referee for the insightful comments on our manuscript. We address each major comment below and have updated the paper accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation): the reported gains in functional alignment and code similarity are attributed to conversational serialization, yet no ablation isolates the alternating-turn format from the mere presence of feedback tokens. Without this comparison the attribution to the serialization step remains untested and the weakest assumption (preservation of debugging intent) is not directly addressed.
Authors: We agree that an explicit ablation separating the alternating-turn structure from the inclusion of feedback tokens would strengthen the attribution. Our current code-only baseline removes feedback entirely, and the conversational models include both elements. In the revised version, we will include an additional baseline where feedback is appended without the conversational turn format to isolate the effect of serialization. This addresses the concern about testing the preservation of debugging intent through the format. revision: yes
-
Referee: [§4] §4 (Method): the serialization procedure collapses multi-turn edits, pauses, and exploratory dead-ends into single turns, but the manuscript provides neither a quantitative measure of information loss nor an analysis showing that the resulting dialogues retain decision order and implicit intent rather than surface co-occurrences.
Authors: We acknowledge that the serialization process involves some aggregation of student actions into turns, which could potentially lose fine-grained temporal information. However, our evaluation demonstrates that models trained on these serialized dialogues better replicate real student debugging sequences compared to baselines, indicating that critical decision points are retained. We will add a new subsection in the revised manuscript providing a qualitative analysis of sample serialized dialogues alongside original logs to illustrate preservation of intent, and discuss the trade-offs of this approach. revision: partial
Circularity Check
No circularity; derivation uses held-out data and external baselines
full rationale
The paper serializes real student logs into conversational turns, applies supervised fine-tuning plus preference optimization, and evaluates functional alignment and code similarity on held-out student submissions against independent baselines (code-only models and prompted LLMs). No equations, parameters, or claims reduce by construction to the inputs; the reported gains are measured via external comparison rather than self-referential fitting or self-citation chains. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Student debugging behavior can be adequately captured by alternating code-submission and environment-feedback turns in a conversational format.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION To better support learners at scale, computing education re- search has long relied on models of students built from the rich log data collected as they solve programming assign- ments [16, 41]. Much of this work focuses on capturing what students know, such as estimating mastery over concepts through knowledge tracing [6, 17, 40], or identif...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELATED WORK A complementary line of work focuses onknowledge tracing (KT), whose primary goal is to estimate a student’s mastery of knowledge components across exercises. While early ap- proaches focused exclusively on predicting success on future assignments, more recent methods leverage language-model– based architectures to predict students’ first sub...
-
[3]
FROM LOGS TO DIALOGS In this section, we introduce our approach for transforming student log data into suitable data for student simulation. Our work assumes that students interact with an automated assessment system returning summative feedback [5]. 3.1 Assumptions We assume access to a deterministic grading function which, for a submitted programato a p...
-
[4]
TRAINING ARTIFICIAL LEARNERS In this section, we present our training pipeline for training a language modelπ θ to simulate how programming students solve assignments. Our core pipeline combines supervised fine-tuning with offline preference optimization on a serial- ized datasetD. We additionally explore online preference optimization as an alternative t...
-
[5]
EXPERIMENTS In this section, we detail the components of our experiments aimed at evaluating the utility of our framework. 5.1 Dataset We evaluate our framework using FalconCode [8], a large- scale CS1 Python programming dataset from the United States Air Force Academy. The dataset includes student submissions, the associated grades, and the course auto- ...
work page 2021
-
[6]
Table 2 reports coverage and generation quality metrics av- eraged across all rollout steps
RESULTS Table 1 summarizes our training and test split statistics. Table 2 reports coverage and generation quality metrics av- eraged across all rollout steps. Figure 1 details performance at each rollout step, and Figure 2 illustrates how model- generated grades evolve compared to students ground truth grades. We highlight several key findings below. Pro...
-
[7]
CONCLUDING DISCUSSION Our experiments show that conversational serialization of student–environment interactions, combined with preference optimization, produces artificial students that more closely track real learners’ debugging behavior than prompted base- lines and models trained without feedback. Performance improvements compared to baselines are con...
-
[8]
S. N. Akter, S. Prabhumoye, J. Kamalu, S. Satheesh, E. Nyberg, M. Patwary, M. Shoeybi, and B. Catanzaro. MIND: Math informed synthetic dialogues for pretraining LLMs. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[9]
N. Ashok Kumar and A. Lan. Improving socratic question generation using data augmentation and preference optimization. In E. Kochmar, M. Bexte, J. Burstein, A. Horbach, R. Laarmann-Quante, A. Tack, V. Yaneva, and Z. Yuan, editors,Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 108–118, Mexi...
work page 2024
-
[10]
D. Azcona and A. F. Smeaton. Targeting at-risk students using engagement and effort predictors in an introductory computer programming course. In European Conference on Technology Enhanced Learning, pages 361–366. Springer, 2017
work page 2017
-
[11]
A. H. Brown. Simulated classrooms and artificial students: The potential effects of new technologies on teacher education.Journal of research on computing in education, 32(2):307–318, 1999
work page 1999
-
[12]
D. L. Butler and P. H. Winne. Feedback and self-regulated learning: A theoretical synthesis.Review of Educational Research, 65:245–281, 1995
work page 1995
-
[13]
A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4):253–278, 1994
work page 1994
-
[14]
M. H. Daniel Han and U. team. Unsloth, 2023
work page 2023
-
[15]
A. de Freitas, J. Coffman, M. de Freitas, J. Wilson, and T. Weingart. Falconcode: A multiyear dataset of python code samples from an introductory computer science course. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, SIGCSE 2023, page 938–944, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[16]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[17]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, page 441, Red Hook, NY, USA, 2023. Curran Associates Inc
work page 2023
-
[18]
N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023
work page 2023
-
[19]
D. Dinucu-Jianu, J. Macina, N. Daheim, I. Hakimi, I. Gurevych, and M. Sachan. From problem-solving to teaching problem-solving: Aligning LLMs with pedagogy using reinforcement learning. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 272–2...
work page 2025
-
[20]
Z. Duan, N. Fernandez, A. Hicks, and A. Lan. Test case-informed knowledge tracing for open-ended coding tasks. InProceedings of the 15th International Learning Analytics and Knowledge Conference, LAK ’25, page 238–248, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[21]
Z. Duan, N. Fernandez, and A. Lan. Kaser: Knowledge-aligned student error simulator for open-ended coding tasks, 2026
work page 2026
-
[22]
Z. Duan, N. Fernandez, A. B. L. Narayanan, M. Hassany, R. S. de Alencar, P. Brusilovsky, B. Akram, and A. Lan. Automated knowledge component generation for interpretable knowledge tracing in coding problems, 2025
work page 2025
-
[23]
M. C. Jadud. Methods and tools for exploring novice compilation behaviour. InProceedings of the second international workshop on Computing education research, pages 73–84, 2006
work page 2006
-
[24]
J. Kasurinen and U. Nikula. Estimating programming knowledge with bayesian knowledge tracing.ACM SIGCSE Bulletin, 41(3):313–317, 2009
work page 2009
-
[25]
C. Koutcheme, N. Dainese, and A. Hellas. Direct repair optimization: Training small language models for educational program repair improves feedback. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 564–581, Vienna, Austria, July 2025. Association for Computational Linguistics
work page 2025
-
[26]
H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc., 2022
work page 2022
-
[27]
J. Leinonen, P. Denny, O. Kiljunen, S. MacNeil, S. Sarsa, and A. Hellas. Llm-itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education. InProceedings of the 27th Australasian Computing Education Conference, ACE ’25, page 56–63, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[28]
N. Liu, Z. Wang, R. Baraniuk, and A. Lan. Open-ended knowledge tracing for computer science education. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3849–3862, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics
work page 2022
-
[29]
S. MacNeil, M. Rogalska, J. Leinonen, P. Denny, A. Hellas, and X. Crosland. Synthetic students: A comparative study of bug distribution between large language models and computing students. In Proceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 1, SIGCSE Virtual 2024, page 137–143, New York, NY, USA,
work page 2024
-
[30]
Association for Computing Machinery
-
[31]
N. Matsuda, W. W. Cohen, J. Sewall, G. Lacerda, and K. R. Koedinger. Evaluating a simulated student using real students data for training and testing. In International Conference on User Modeling, pages 107–116. Springer, 2007
work page 2007
-
[32]
M. Miroyan, R. Niousha, J. E. Gonzalez, G. Ranade, and N. Norouzi. Parastudent: Generating and evaluating realistic student code by teaching llms to struggle, 2025
work page 2025
-
[33]
Orca: Progressive learning from complex explanation traces of GPT-4
S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707, 2023
-
[34]
M. H. Nguyen, V.-A. P˘ adurean, A. Gotovos, S. Tschiatschek, and A. Singla. Synthesizing high-quality programming tasks with llm-based expert and student agents. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, and S. Isotani, editors,Artificial Intelligence in Education, pages 77–91, Cham, 2025. Springer Nature Switzerland
work page 2025
- [35]
-
[36]
T. Phung, V.-A. P˘ adurean, A. Singh, C. Brooks, J. Cambronero, S. Gulwani, A. Singla, and G. Soares. Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation. In Proceedings of the 14th Learning Analytics and Knowledge Conference, LAK ’24, page 12–23, New York, NY, U...
work page 2024
-
[37]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc
work page 2023
-
[38]
S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020
work page 2020
-
[39]
A. Ross, M. Srivastava, J. Blanchard, and J. Andreas. Modeling student learning with 3.8 million program traces.arXiv preprint arXiv:2510.05056, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
A. Scarlatos, R. S. Baker, and A. Lan. Exploring knowledge tracing in tutor-student dialogues using llms. InProceedings of the 15th Learning Analytics and Knowledge Conference, LAK 2025, Dublin, Ireland, March 3-7, 2025. ACM, 2025
work page 2025
-
[41]
A. Scarlatos, N. Liu, J. Lee, R. Baraniuk, and A. Lan. Training llm-based tutors to improve student learning outcomes in dialogues. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, and S. Isotani, editors,Artificial Intelligence in Education, pages 251–266, Cham, 2025. Springer Nature Switzerland
work page 2025
-
[42]
A. Scarlatos, D. Smith, S. Woodhead, and A. Lan. Improving the validity of automatically generated feedback via reinforcement learning. InInternational Conference on Artificial Intelligence in Education, pages 280–294. Springer, 2024
work page 2024
-
[43]
J. Schulman and T. M. Lab. Lora without regret. Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/lora/
work page 2025
- [44]
-
[45]
T. Sirki ¨a and J. Sorva. Exploring programming misconceptions: an analysis of student mistakes in visual program simulation exercises. InProceedings of the 12th Koli calling international conference on computing education research, pages 19–28, 2012
work page 2012
-
[46]
Lora hyperparameters guide, 2024
Unsloth AI. Lora hyperparameters guide, 2024. Accessed: 2025-12-23
work page 2024
-
[47]
K. VanLehn, S. Ohlsson, and R. Nason. Applications of simulated students: An exploration.Journal of artificial intelligence in education, 5:135–135, 1994
work page 1994
-
[48]
L. Wang, A. Sy, L. Liu, and C. Piech. Deep knowledge tracing on programming exercises. InProceedings of the fourth (2017) ACM conference on learning@ scale, pages 201–204, 2017
work page 2017
- [49]
-
[50]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[51]
J. Woodrow, C. Piech, and S. Koyejo. Improving generative ai student feedback: Direct preference optimization with teachers in the loop. InProceedings of the 18th International Conference on Educational Data Mining, pages 442–449. International Educational Data Mining Society, July 2025
work page 2025
-
[52]
C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [53]
-
[54]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[55]
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025
work page 2025
-
[56]
T. Zheng, G. Zhang, T. Shen, X. Liu, B. Y. Lin, J. Fu, W. Chen, and X. Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. In L. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 12834–12859. Associa...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.