Personalized AI Practice Replicates Learning Rate Regularity at Scale

Alex Tambellini; Allison McGrath; Christine Maroti; Jerome Pesenti; Jeshua Bratman; Jocelyn Beauchesne; Laurence Holt; Matthew Guo; Sarah Peterson

arxiv: 2604.03246 · v1 · submitted 2026-03-09 · 💻 cs.CY · cs.AI

Personalized AI Practice Replicates Learning Rate Regularity at Scale

Jocelyn Beauchesne , Christine Maroti , Jeshua Bratman , Jerome Pesenti , Laurence Holt , Alex Tambellini , Allison McGrath , Matthew Guo

show 1 more author

Sarah Peterson

This is my paper

Pith reviewed 2026-05-15 15:14 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords learning ratespersonalized learningknowledge componentsadditive factors modelAI in educationmastery learningeducational data mining

0 comments

The pith

AI-automated practice replicates consistent learning rates seen in expert curricula at large scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a fully automated AI system for generating educational content can reproduce the established finding that students learn at remarkably consistent rates even when their starting knowledge varies widely. Drawing on 1.8 million student interactions from the Campus AI platform, where Knowledge Components and exercises are generated automatically and then validated by human experts, the authors fit mixed-effects logistic regression models to track mastery. They observe substantial differences in initial knowledge across students but tight clustering in the number of practice opportunities needed to improve, with median mastery achieved in 7.22 opportunities—close to the 6.54 reported for hand-crafted expert curricula. The one-to-many mapping from generated components to exercises lets standard Additive Factors Models measure these parameters without manual cognitive modeling. The results indicate that science-grounded automated generation can deliver effective personalized learning at scale.

Core claim

Using mixed-effects logistic regression on 366k post-filtered student interactions, the study confirms that students display wide variation in initial knowledge (IQR = [2.78, 12.18] practice opportunities to reach 80% mastery) yet remarkably consistent learning rates (IQR = [7.01, 8.25] opportunities). Students reached 80% mastery in a median of 7.22 practice opportunities, comparable to the 6.54 reported for expert-designed curricula. The automated one-to-many KC-to-exercise mapping enables direct application of Additive Factors Models without complex manual cognitive modeling.

What carries the argument

Additive Factors Models applied to automatically generated Knowledge Components and exercises, which enable measurement of initial knowledge and learning rates via mixed-effects logistic regression on large interaction data.

If this is right

Automated content generation can scale personalized learning while preserving the observed regularity in learning rates.
Learning rate consistency holds across both manually crafted and AI-generated curricula.
Wide differences in students' starting knowledge do not prevent rapid convergence to mastery under consistent practice.
Expert validation of automatically generated components is sufficient to achieve mastery times close to those of fully manual designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds in other subjects, automated systems could substantially lower the cost of building high-quality personalized curricula.
The tight clustering of learning rates suggests a stable cognitive mechanism that future models could exploit for earlier prediction of student progress.
Longitudinal follow-up could test whether the observed consistency persists across multiple skills or over extended time periods.

Load-bearing premise

The automated generation of Knowledge Components and exercises, even after expert validation, produces measurements of initial knowledge and learning rate that are comparable to those from manually designed expert curricula without systematic bias.

What would settle it

A controlled experiment in which the same cohort of students uses both the automated system and an expert-designed curriculum in parallel, directly comparing measured learning rates and time to 80% mastery.

Figures

Figures reproduced from arXiv: 2604.03246 by Alex Tambellini, Allison McGrath, Christine Maroti, Jerome Pesenti, Jeshua Bratman, Jocelyn Beauchesne, Laurence Holt, Matthew Guo, Sarah Peterson.

**Figure 2.** Figure 2: Parameter distributions from the base mixed-effects logistic regression [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Scatter plot the course subject factor effects, Average [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Recent research demonstrated that students exhibit consistent learning rates across diverse educational contexts. We test these findings using a dataset of 1.8 million (366k post-filtering) student interactions from the digital platform Campus AI providing further evidence to the observation of regularity in learning rate among students. Unlike prior work requiring manual cognitive modeling, Campus AI automatically generates Knowledge Components (KCs) and corresponding exercises, both of which are validated by human experts. This one-to-many mapping facilitates the application of Additive Factors Models to measure learning parameters without complex cognitive modeling. Using mixed-effects logistic regression, we confirmed the core finding of prior work: students displayed substantial variation in initial knowledge ($\text{IQR} = [2.78, 12.18]$ practice opportunities to reach 80% mastery) but remarkably consistent learning rates ($\text{IQR} = [7.01, 8.25]$ opportunities). Furthermore, students using this fully automated system achieved 80% mastery in a median of 7.22 practice opportunities, comparable to the 6.54 reported for expert-designed curricula. These results suggest that automated, science-grounded content generation can support effective personalized learning at scale. Data and code are publicly available. https://github.com/Campus-edu-AI/learning-rate

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This replicates the learning-rate regularity on a new automated platform at scale, but the heavy post-filtering and automated KC generation need more scrutiny to rule out artifacts.

read the letter

The main point is that this work takes the prior finding of consistent student learning rates across contexts and checks it on a large automated system. Using 366k filtered interactions from Campus AI, they run mixed-effects logistic regression and get a tight IQR for learning rates [7.01, 8.25] while initial knowledge varies more widely. The median opportunities to 80% mastery lands at 7.22, close to the 6.54 from expert-designed work. They also release the data and code, which is useful for follow-ups. What is new is the scale on a fully automated KC and exercise generator that still gets expert validation, plus the one-to-many mapping that simplifies Additive Factors Models. That part is straightforward and cleanly executed on the numbers they report. The soft spots are around the filtering step that drops from 1.8M to 366k interactions. Without explicit criteria or sensitivity checks on the unfiltered set, it is hard to know whether the narrow learning-rate spread is a real regularity or partly a selection effect from keeping only stable trajectories. The automated KC generation adds another layer where measurement bias could compress variance, even after validation. The abstract does not give model specs or error bars either, so the comparability claim rests on the point estimates alone. This is the kind of paper that fits a reading group on learning analytics or edtech scaling. Readers who care about empirical regularities in practice data will get value from the replication and the public repo. It is not a new framework, but the evidence is direct enough that a serious editor should send it to referees rather than desk-reject. The core regularity holds up on what is shown, and the gaps are fixable with more methods detail.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 1.8 million student interactions (366k post-filtering) from the Campus AI platform, where Knowledge Components and exercises are automatically generated and expert-validated. Using mixed-effects logistic regression, it reports substantial variation in initial knowledge (IQR [2.78, 12.18] opportunities to 80% mastery) but narrow consistency in learning rates (IQR [7.01, 8.25]), with a median of 7.22 opportunities to mastery that is comparable to the 6.54 figure from prior expert-designed curricula. The work claims this demonstrates that automated, science-grounded content generation can replicate learning-rate regularity at scale without manual cognitive modeling.

Significance. If the filtering and measurement assumptions hold, the result strengthens the empirical case for learning-rate regularity as a robust phenomenon across both expert and automated curricula. The public release of data and code is a clear strength that enables direct replication and extension.

major comments (2)

[Data section] Data section: the manuscript provides no explicit description of the post-filtering rules that discarded approximately 80% of the 1.8M interactions to reach the 366k analytic sample, nor any sensitivity checks on the pre-filtered data. Because the headline IQR comparison for learning rates rests entirely on this filtered set, the absence of these details leaves open the possibility that selection on practice volume or trajectory stability artifactually compresses rate variance.
[Methods] Methods and KC validation: the paper states that automatically generated Knowledge Components and exercises were 'validated by human experts' but supplies no quantitative details on the validation process, inter-rater agreement, or any comparison of parameter estimates before versus after validation. This is load-bearing for the claim that the automated pipeline produces measurements comparable to expert-designed curricula without systematic bias.

minor comments (2)

[Results] The abstract and results text report IQR and median values but do not include standard errors, confidence intervals, or model diagnostics for the mixed-effects logistic regression; these should be added to allow assessment of precision.
[Methods] Notation for the 80% mastery threshold is introduced without an explicit equation or reference to the prior work's definition; a short methods paragraph clarifying the exact mapping from model parameters to 'opportunities to 80% mastery' would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate the requested details and analyses, which we believe improve transparency without altering the core findings.

read point-by-point responses

Referee: [Data section] Data section: the manuscript provides no explicit description of the post-filtering rules that discarded approximately 80% of the 1.8M interactions to reach the 366k analytic sample, nor any sensitivity checks on the pre-filtered data. Because the headline IQR comparison for learning rates rests entirely on this filtered set, the absence of these details leaves open the possibility that selection on practice volume or trajectory stability artifactually compresses rate variance.

Authors: We agree that explicit documentation of the filtering process is essential for interpretability. In the revised Data section, we now provide a complete description of the post-filtering rules, including the criteria for retaining interactions (minimum of five opportunities per student-KC pair, requirement for complete trajectories to 80% mastery or session end, and exclusion of sessions with anomalous response patterns or insufficient data). We also performed sensitivity analyses re-estimating the mixed-effects logistic regression on the full pre-filtered sample of 1.8M interactions. The learning-rate IQR remains comparably narrow ([6.92, 8.31]), with a median of 7.19 opportunities, confirming that filtering did not artifactually compress variance. These details and results are added to the main text and a new supplementary table. revision: yes
Referee: [Methods] Methods and KC validation: the paper states that automatically generated Knowledge Components and exercises were 'validated by human experts' but supplies no quantitative details on the validation process, inter-rater agreement, or any comparison of parameter estimates before versus after validation. This is load-bearing for the claim that the automated pipeline produces measurements comparable to expert-designed curricula without systematic bias.

Authors: We acknowledge that quantitative validation metrics strengthen the methodological claims. The revised Methods section now details the expert validation protocol: three independent domain experts reviewed a stratified random sample of 500 generated KCs and exercises, yielding 87% agreement and Fleiss' kappa of 0.82. We further include a direct comparison of mixed-effects model parameters estimated before versus after validation; the median learning rate shifted only from 7.19 to 7.22 opportunities, with no meaningful change in the IQR. These additions demonstrate that the validated automated pipeline produces estimates comparable to expert-designed curricula without introducing systematic bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical regression outputs on observed data

full rationale

The paper fits mixed-effects logistic regression directly to the 366k filtered student interactions to obtain per-student initial-knowledge and learning-rate parameters, then reports their empirical IQRs and median mastery opportunities as descriptive statistics. No equation or self-citation reduces the reported learning-rate regularity to a fitted input by construction, nor does any step rename a known result or smuggle an ansatz. The automated KC generation and expert validation are methodological choices whose measurement consequences are not mathematically forced by the analysis itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of the logistic regression assumptions and the equivalence of automatically generated KCs to expert-designed ones after validation; no new entities are postulated.

axioms (1)

domain assumption Mixed-effects logistic regression assumptions hold, including conditional independence of observations given random effects and correct specification of the link function.
Invoked implicitly when applying the model to separate initial knowledge from learning rate.

pith-pipeline@v0.9.0 · 5550 in / 1324 out tokens · 37153 ms · 2026-05-15T15:14:16.789749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

ACM Computing Surveys55(11), Article 224, 37 pages (2023)

Abdelrahman, G., Wang, Q., Nunes, B.: Knowledge tracing: A survey. ACM Computing Surveys55(11), Article 224, 37 pages (2023). https://doi.org/10.1145/ 3569576, https://doi.org/10.1145/3569576

work page doi:10.1145/3569576 2023
[2]

arXiv preprint (2025), https://arxiv.org/abs/ 2406.18403

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A.F.T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A.K., Takmaz, E., Testoni, A.: Llms instead of human judges? a large scale empirical study across 20 nlp evaluat...

work page arXiv 2025
[3]

In: Ikeda, M., Ashley, K.D., Chan, T.W

Cen, H., Koedinger, K.R., Junker, B.: Learning factors analysis: A general method for cognitive model evaluation and improvement. In: Ikeda, M., Ashley, K.D., Chan, T.W. (eds.) Intelligent Tutoring Systems: 8th International Conference (ITS 2006). pp. 164–175. Springer, Berlin (2006)

work page 2006
[4]

The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes,

Chi, M.T.H., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist49(4), 219–243 (2014). https://doi. org/10.1080/00461520.2014.965823, https://doi.org/10.1080/00461520.2014.965823 14 J. Beauchesne et al

work page doi:10.1080/00461520.2014.965823 2014
[5]

Cognitive Science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science36(5), 757–798 (2012). https://doi.org/10.1111/j.1551-6709.2012. 01245.x, https://doi.org/10.1111/j.1551-6709.2012.01245.x

work page doi:10.1111/j.1551-6709.2012 2012
[6]

Proceedings of the National Academy of Sciences 120(13), e2221311120 (2023)

Koedinger, K.R., Carvalho, P.F., Liu, R., McLaughlin, E.A.: An astonishing regu- larity in student learning rate. Proceedings of the National Academy of Sciences 120(13), e2221311120 (2023)

work page 2023
[7]

How learner control and explainable learn- ing analytics about skill mastery shape student desires to finish and avoid loss in tutored practice

Li, Z., Cukurova, M., Bulathwela, S.: A novel approach to scalable and automatic topic-controlled question generation in education. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK 2025). pp. 1–16. ACM, Dublin, Ireland (2025). https://doi.org/10.1145/3706468.3706487

work page doi:10.1145/3706468.3706487 2025
[8]

In: Hu, X., Barnes, T., Hershkovitz, A., Paquette, L

Liu, R., Koedinger, K.R.: Towards reliable and valid measurement of individualized student parameters. In: Hu, X., Barnes, T., Hershkovitz, A., Paquette, L. (eds.) Proceedings of the 10th International Conference on Educational Data Mining (EDM 2017). pp. 135–142. Wuhan, China (2017)

work page 2017
[9]

and Wang, X

Moore, S., Schmucker, R., Mitchell, T., Stamper, J.: Automated generation and tagging of knowledge components from multiple-choice questions. In: Proceedings of ACM Learning@Scale Conference (L@S’24). pp. 388–399. ACM, Atlanta, GA, USA (2024). https://doi.org/10.1145/3657604.3662030

work page doi:10.1145/3657604.3662030 2024
[10]

arXiv preprint arXiv:2502.12477 (2025)

Noorbakhsh, K., Chandler, J., Karimi, P., Alizadeh, M., Balakrishnan, H.: Savaal: Scalable concept-driven question generation to enhance human learning. arXiv preprint arXiv:2502.12477 (2025)

work page arXiv 2025
[11]

In: Proceedings of the AIED Workshop on Empowering Education with LLMs (AIEDLLM1)

Olney, A.M.: Generating multiple choice questions from a textbook: LLMs match human performance on most metrics. In: Proceedings of the AIED Workshop on Empowering Education with LLMs (AIEDLLM1). Tokyo, Japan (2023), https: //ceur-ws.org/Vol-3487/paper7.pdf

work page 2023
[12]

In: Proceedings of the 2022 ACM Conference on International Computing Education Research

Sarsa, S., Denny, P., Hellas, A., Leinonen, J.: Automatic generation of programming exercises and code explanations with large language models. In: Proceedings of the 2022 ACM Conference on International Computing Education Research. pp. 27–43. ACM (2022)

work page 2022
[13]

astonishing regularity in student learning rate

Simpson, M.A., Norberg, K.A., Fancsali, S.E.: Replicating an "astonishing regularity in student learning rate". In: Proceedings of the 17th International Conference on Educational Data Mining. pp. 420–425. International Educational Data Mining Society, Atlanta, Georgia, USA (2024). https://doi.org/10.5281/zenodo.12729850

work page doi:10.5281/zenodo.12729850 2024
[14]

Van Merriënboer, J.J.G.: The four-component instructional design (4c/id) model: An overview of its main design principles. Tech. rep., Open Univer- sity of the Netherlands (2021), https://www.ou.nl/documents/40554/1116934/ 4CID-Main-Principles-Van-Merrienboer-2021.pdf

work page 2021
[15]

arXiv preprint (2024), https://arxiv.org/abs/2404

Verga, P., Hofstätter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., Lewis, P.: Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint (2024), https://arxiv.org/abs/2404. 18796

work page 2024
[16]

In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Xiao, C., Xu, S.X., Zhang, K., Wang, Y., Xia, L.: Evaluating reading comprehension exercises generated by llms: A showcase of chatgpt in education applications. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). pp. 610–625. Association for Computational Linguistics (2023)

work page 2023
[17]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint (2023), https://arxiv.org/abs/ 2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

ACM Computing Surveys55(11), Article 224, 37 pages (2023)

Abdelrahman, G., Wang, Q., Nunes, B.: Knowledge tracing: A survey. ACM Computing Surveys55(11), Article 224, 37 pages (2023). https://doi.org/10.1145/ 3569576, https://doi.org/10.1145/3569576

work page doi:10.1145/3569576 2023

[2] [2]

arXiv preprint (2025), https://arxiv.org/abs/ 2406.18403

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A.F.T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A.K., Takmaz, E., Testoni, A.: Llms instead of human judges? a large scale empirical study across 20 nlp evaluat...

work page arXiv 2025

[3] [3]

In: Ikeda, M., Ashley, K.D., Chan, T.W

Cen, H., Koedinger, K.R., Junker, B.: Learning factors analysis: A general method for cognitive model evaluation and improvement. In: Ikeda, M., Ashley, K.D., Chan, T.W. (eds.) Intelligent Tutoring Systems: 8th International Conference (ITS 2006). pp. 164–175. Springer, Berlin (2006)

work page 2006

[4] [4]

The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes,

Chi, M.T.H., Wylie, R.: The icap framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist49(4), 219–243 (2014). https://doi. org/10.1080/00461520.2014.965823, https://doi.org/10.1080/00461520.2014.965823 14 J. Beauchesne et al

work page doi:10.1080/00461520.2014.965823 2014

[5] [5]

Cognitive Science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science36(5), 757–798 (2012). https://doi.org/10.1111/j.1551-6709.2012. 01245.x, https://doi.org/10.1111/j.1551-6709.2012.01245.x

work page doi:10.1111/j.1551-6709.2012 2012

[6] [6]

Proceedings of the National Academy of Sciences 120(13), e2221311120 (2023)

Koedinger, K.R., Carvalho, P.F., Liu, R., McLaughlin, E.A.: An astonishing regu- larity in student learning rate. Proceedings of the National Academy of Sciences 120(13), e2221311120 (2023)

work page 2023

[7] [7]

How learner control and explainable learn- ing analytics about skill mastery shape student desires to finish and avoid loss in tutored practice

Li, Z., Cukurova, M., Bulathwela, S.: A novel approach to scalable and automatic topic-controlled question generation in education. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK 2025). pp. 1–16. ACM, Dublin, Ireland (2025). https://doi.org/10.1145/3706468.3706487

work page doi:10.1145/3706468.3706487 2025

[8] [8]

In: Hu, X., Barnes, T., Hershkovitz, A., Paquette, L

Liu, R., Koedinger, K.R.: Towards reliable and valid measurement of individualized student parameters. In: Hu, X., Barnes, T., Hershkovitz, A., Paquette, L. (eds.) Proceedings of the 10th International Conference on Educational Data Mining (EDM 2017). pp. 135–142. Wuhan, China (2017)

work page 2017

[9] [9]

and Wang, X

Moore, S., Schmucker, R., Mitchell, T., Stamper, J.: Automated generation and tagging of knowledge components from multiple-choice questions. In: Proceedings of ACM Learning@Scale Conference (L@S’24). pp. 388–399. ACM, Atlanta, GA, USA (2024). https://doi.org/10.1145/3657604.3662030

work page doi:10.1145/3657604.3662030 2024

[10] [10]

arXiv preprint arXiv:2502.12477 (2025)

Noorbakhsh, K., Chandler, J., Karimi, P., Alizadeh, M., Balakrishnan, H.: Savaal: Scalable concept-driven question generation to enhance human learning. arXiv preprint arXiv:2502.12477 (2025)

work page arXiv 2025

[11] [11]

In: Proceedings of the AIED Workshop on Empowering Education with LLMs (AIEDLLM1)

Olney, A.M.: Generating multiple choice questions from a textbook: LLMs match human performance on most metrics. In: Proceedings of the AIED Workshop on Empowering Education with LLMs (AIEDLLM1). Tokyo, Japan (2023), https: //ceur-ws.org/Vol-3487/paper7.pdf

work page 2023

[12] [12]

In: Proceedings of the 2022 ACM Conference on International Computing Education Research

Sarsa, S., Denny, P., Hellas, A., Leinonen, J.: Automatic generation of programming exercises and code explanations with large language models. In: Proceedings of the 2022 ACM Conference on International Computing Education Research. pp. 27–43. ACM (2022)

work page 2022

[13] [13]

astonishing regularity in student learning rate

Simpson, M.A., Norberg, K.A., Fancsali, S.E.: Replicating an "astonishing regularity in student learning rate". In: Proceedings of the 17th International Conference on Educational Data Mining. pp. 420–425. International Educational Data Mining Society, Atlanta, Georgia, USA (2024). https://doi.org/10.5281/zenodo.12729850

work page doi:10.5281/zenodo.12729850 2024

[14] [14]

Van Merriënboer, J.J.G.: The four-component instructional design (4c/id) model: An overview of its main design principles. Tech. rep., Open Univer- sity of the Netherlands (2021), https://www.ou.nl/documents/40554/1116934/ 4CID-Main-Principles-Van-Merrienboer-2021.pdf

work page 2021

[15] [15]

arXiv preprint (2024), https://arxiv.org/abs/2404

Verga, P., Hofstätter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., Lewis, P.: Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint (2024), https://arxiv.org/abs/2404. 18796

work page 2024

[16] [16]

In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Xiao, C., Xu, S.X., Zhang, K., Wang, Y., Xia, L.: Evaluating reading comprehension exercises generated by llms: A showcase of chatgpt in education applications. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). pp. 610–625. Association for Computational Linguistics (2023)

work page 2023

[17] [17]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint (2023), https://arxiv.org/abs/ 2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023