arxiv: 2604.06385 · v1 · submitted 2026-04-07 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh , Xiaokun Wang , Anurag Garikipati , Madalina Ciobanu , Qingqing Mao , Ritankar Das

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords pedagogical LLMsreinforcement learningsupervised fine-tuningopen-source modelseducational AIdomain specializationQwen3-32BCDPK benchmark

0 comments

The pith

A multi-stage RL and SFT pipeline applied to a 32B open-source LLM creates pedagogical models that set new SOTA results and surpass larger proprietary systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a three-stage optimization process: first reinforcement learning with progressive difficulty training, challenging examples, and extended reasoning rollouts; then supervised fine-tuning that uses the RL model to generate high-quality training data via difficulty-weighted sampling; and optionally a second RL round. This pipeline is run on the Qwen3-32B backbone to produce the EduQwen 32B family of models. The resulting models reach high accuracy on the Cross-Domain Pedagogical Knowledge Benchmark, establishing new state-of-the-art scores on the Pedagogy Benchmark Leaderboard and exceeding the performance of much larger general-purpose systems such as Gemini-3 Pro. The central demonstration is that targeted, application-driven optimization can convert mid-sized open-source models into specialized pedagogical experts while retaining the transparency and efficiency benefits of open models. A sympathetic reader would care because this approach offers a concrete route to high-performing educational AI that does not require access to the largest closed models.

Core claim

The authors establish that their multi-stage optimization strategy of RL with progressive difficulty training and extended rollouts, followed by SFT using difficulty-weighted sampling from the RL model, and an optional second RL stage, transforms the dense Qwen3-32B backbone into EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2. These models achieve sufficient accuracy on the CDPK Benchmark to set new SOTA results across the Pedagogy Benchmark Leaderboard and exceed the performance of significantly larger proprietary models such as Gemini-3 Pro.

What carries the argument

The multi-stage RL and SFT optimization pipeline that implements progressive difficulty training, extended reasoning rollouts, and difficulty-weighted data synthesis from the RL-trained model.

If this is right

Domain-specialized optimization can enable mid-sized open-source LLMs to outperform significantly larger general-purpose systems within the target domain.
The resulting models preserve transparency, customizability, and cost-efficiency suitable for responsible educational deployment.
Application-driven training can convert general backbones into true domain experts rather than relying on scale alone.
The same pipeline offers a repeatable method for building other specialized open models without proprietary resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged RL and SFT strategies could be tested in other narrow domains such as medical reasoning or legal analysis to produce competitive open experts.
If benchmark gains translate to classroom settings, these models could support lower-cost personalized tutoring tools that schools can inspect and modify.
The emphasis on difficulty-weighted sampling suggests that future work might measure how much the progressive curriculum itself, rather than the base model size, drives the gains.

Load-bearing premise

The CDPK Benchmark and Pedagogy Benchmark Leaderboard provide a valid and generalizable measure of pedagogical capability beyond the test distribution, and the reported gains are caused by the described RL and SFT pipeline.

What would settle it

A new, held-out evaluation set of pedagogical tasks drawn from fresh domains or real classroom scenarios where the EduQwen models lose their reported advantage over Gemini-3 Pro or other baselines.

Figures

Figures reproduced from arXiv: 2604.06385 by Anurag Garikipati, Madalina Ciobanu, Navan Preet Singh, Qingqing Mao, Ritankar Das, Xiaokun Wang.

**Figure 1.** Figure 1: illustrates the training dynamics of the RL optimization phase. The reward signal exhibits high variance during early training as the model explores the pedagogical reasoning space, then shows rapid improvement with reward climbing from approximately 0.5 to near 1.0 within the first 100 steps. Convergence is achieved by approximately step 400, with variance decreasing substantially as the model stabilizes… view at source ↗

**Figure 2.** Figure 2: shows the corresponding SFT training loss curve, which decreases from approximately 0.5 to near zero, with convergence occurring by approximately step 150. The rapid convergence of both training phases suggests that the Qwen3-32B base model is highly amenable to pedagogical specialization, and that our difficulty-weighted curriculum provides an efficient learning signal. Notably, all stages of our optimiza… view at source ↗

read the original abstract

We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a progressive-difficulty RL plus difficulty-weighted SFT pipeline to Qwen3-32B and claims the resulting EduQwen models set SOTA on their CDPK pedagogical benchmark while beating Gemini-3 Pro, but supplies no ablations or benchmark validation to support the attribution.

read the letter

The main takeaway is that the authors outline a three-stage recipe—RL with progressive difficulty and extended rollouts, followed by SFT on data synthesized from the RL model using difficulty-weighted sampling, plus an optional second RL round—to turn a 32B open-source backbone into specialized pedagogical models. They report these variants reaching new highs on the Cross-Domain Pedagogical Knowledge benchmark and surpassing larger proprietary systems like Gemini-3 Pro. That is the core claim worth noting right away. The work does a reasonable job of framing the optimization as application-driven rather than generic, and it correctly highlights the practical upside of making mid-sized open models competitive in a domain where transparency and cost matter. The choice to focus RL on hard examples and to weight SFT data by difficulty is a straightforward but sensible adaptation of techniques already used in math and coding fine-tuning. What is actually new is the specific sequencing and the target domain, though it remains an incremental extension rather than a methodological advance. The soft spots are substantial and directly affect the central result. The abstract states the SOTA outcome and the outperformance but includes no numbers, no ablation tables isolating each stage, no statistical tests, and no description of how the CDPK test set was constructed or whether it overlaps with the RL-generated training data. Without those checks it is impossible to rule out data contamination, base-model effects, or other unstated factors as the source of the gains. The benchmark itself is treated as a black box, so we cannot assess whether it measures genuine pedagogical capability that generalizes. This paper is mainly for researchers and practitioners who want concrete recipes for domain-specialized post-training in education. A reader looking for ideas on how to adapt RL and SFT to vertical tasks could extract some useful steps, but anyone expecting reproducible evidence or causal insight will find the current version thin. I would send it to peer review if the full manuscript adds the missing ablations, leakage analysis, and raw scores; otherwise the claims are too under-supported to justify referee time.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a three-stage optimization pipeline for pedagogical LLMs: (1) reinforcement learning on the Qwen3-32B backbone with progressive difficulty training, focus on hard examples, and extended reasoning rollouts; (2) supervised fine-tuning on high-quality data synthesized by the RL model using difficulty-weighted sampling; and (3) an optional second RL round. The resulting EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 models are claimed to achieve new state-of-the-art accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark and the interactive Pedagogy Benchmark Leaderboard, outperforming significantly larger proprietary models including the prior leader Gemini-3 Pro.

Significance. If the experimental results and causal attribution hold after proper validation, the work would demonstrate that targeted, application-driven RL+SFT pipelines can convert mid-sized open-source dense models into domain experts that surpass much larger general-purpose systems in a specialized area. This would support broader adoption of transparent, customizable, and cost-efficient open-source models for educational AI.

major comments (3)

[Abstract] Abstract: The central SOTA claim and outperformance of Gemini-3 Pro are asserted without any numerical benchmark scores, baseline comparisons, statistical significance tests, error bars, or references to result tables. This absence prevents verification of the magnitude and reliability of the reported gains.
[Experimental Setup / Results] Experimental Setup / Results sections: No description is given of how the CDPK Benchmark test set was constructed, including selection criteria or checks for overlap with the RL-synthesized training data used in stages 1-3. This leaves open the possibility of data leakage or contamination as an alternative explanation for the SOTA scores.
[Results] Results section: The manuscript provides no ablation studies or controlled comparisons isolating the contribution of progressive-difficulty RL, difficulty-weighted SFT data synthesis, and the optional second RL stage. Without these, it is impossible to attribute performance improvements causally to the described pipeline rather than the Qwen3-32B backbone or unstated factors.

minor comments (1)

[Introduction] The model variant naming (EduQwen 32B-RL1, EduQwen 32B-SFT, EduQwen 32B-SFT-RL2) and the exact differences between stages would benefit from a summary table early in the paper for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate the revisions we will incorporate to improve clarity, verifiability, and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The central SOTA claim and outperformance of Gemini-3 Pro are asserted without any numerical benchmark scores, baseline comparisons, statistical significance tests, error bars, or references to result tables. This absence prevents verification of the magnitude and reliability of the reported gains.

Authors: We agree that the abstract would be strengthened by explicit numerical support for the SOTA claims. In the revised manuscript we will update the abstract to report the primary CDPK Benchmark accuracies for the EduQwen 32B models, the direct comparison to Gemini-3 Pro, and explicit pointers to the results tables that contain the full baseline comparisons and any statistical details. revision: yes
Referee: [Experimental Setup / Results] Experimental Setup / Results sections: No description is given of how the CDPK Benchmark test set was constructed, including selection criteria or checks for overlap with the RL-synthesized training data used in stages 1-3. This leaves open the possibility of data leakage or contamination as an alternative explanation for the SOTA scores.

Authors: We will expand the Experimental Setup section with a full description of the CDPK test-set construction process, the selection criteria applied, and the explicit checks performed to confirm no overlap with the RL- and SFT-synthesized training data. These additions will directly address the data-leakage concern. revision: yes
Referee: [Results] Results section: The manuscript provides no ablation studies or controlled comparisons isolating the contribution of progressive-difficulty RL, difficulty-weighted SFT data synthesis, and the optional second RL stage. Without these, it is impossible to attribute performance improvements causally to the described pipeline rather than the Qwen3-32B backbone or unstated factors.

Authors: The staged models (RL1, SFT, SFT-RL2) already allow cumulative comparison, yet we acknowledge that targeted ablations would strengthen causal attribution. In the revision we will add controlled comparisons that isolate the effect of progressive-difficulty training and difficulty-weighted sampling, while noting that a full second-round RL ablation is computationally intensive and will be included where resources permit. revision: partial

Circularity Check

0 steps flagged

No circularity; performance claims rest on external benchmarks rather than internal definitions

full rationale

The paper presents a multi-stage RL+SFT optimization pipeline (progressive-difficulty RL with extended rollouts, followed by difficulty-weighted SFT data synthesis from the RL model, optional second RL) applied to the Qwen3-32B backbone. The central result is empirical SOTA accuracy on the independent Cross-Domain Pedagogical Knowledge (CDPK) Benchmark and Pedagogy Benchmark Leaderboard, surpassing Gemini-3 Pro. No equations, fitted parameters, or self-citations are invoked such that benchmark scores reduce to the training process by construction. The benchmark is treated as an external evaluation set, and the method description contains no self-definitional loops, renamed known results, or load-bearing uniqueness theorems. The derivation chain is therefore self-contained as a standard empirical fine-tuning workflow evaluated on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard RL reward design and sampling hyperparameters whose values are not reported.

pith-pipeline@v0.9.0 · 5598 in / 1496 out tokens · 71638 ms · 2026-05-10T19:49:46.047402+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean (LogicNat 8-tick orbit structure) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

progressively increased rollout length from 5 to 8 steps during RL training... difficulty-ordered curriculum of 440 training data points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 3 internal anchors

[1]

URL https://benchmarks.ai-for-education. org/. Retrieved 01/21/2026. Bengio, Y ., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, pp. 41– 48,

2026
[2]

arXiv preprint arXiv:2509.22745 , year=

Kim, J., Song, M., Shin, S., and Son, S. Defending MoE LLMs against harmful fine-tuning via safety routing align- ment.arXiv preprint arXiv:2509.22745,

work page arXiv
[3]

V ., Mack- intosh, A., Portela, M

Leli`evre, M., Waldock, A., Liu, M., Aspillaga, N. V ., Mack- intosh, A., Portela, M. J. O., Lee, J., Atherton, P., Ince, R. A. A., and Garrod, O. G. B. Benchmarking the pedagogi- cal knowledge of large language models.arXiv preprint arXiv:2506.18710,

work page arXiv
[4]

Liao, Q. V . and Vaughan, J. W. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.01941,

work page arXiv
[5]

Y ., Zeng, L., Yan, R., Sun, Y ., Liu, Y ., and Zhou, Y

Liu, J., Wang, C., Liu, C. Y ., Zeng, L., Yan, R., Sun, Y ., Liu, Y ., and Zhou, Y . Improving multi-step reasoning abilities of large language models with direct advantage policy optimization.arXiv preprint arXiv:2412.18279,

work page arXiv
[6]

The open source advantage in large language models (llms).arXiv:2412.12004, 2025

Manchanda, J., Boettcher, L., Westphalen, M., and Jasser, J. The open source advantage in large language models (llms).arXiv preprint arXiv:2412.12004,

work page arXiv
[7]

Large Language Models: A Survey

7 Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via RL and SFT Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey.arXiv preprint arXiv:2402.06196,

work page internal anchor Pith review arXiv
[8]

Towards the pedagogical steering of large language models for tutoring.arXiv preprint arXiv:2410.03781,

Puech, R., Macina, J., Chatain, J., Sachan, M., and Kapur, M. Towards the pedagogical steering of large language models for tutoring.arXiv preprint arXiv:2410.03781,

work page arXiv
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://scale. com/leaderboard/tutorbench. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., and Guo, D. DeepSeek- Math: Pushing the limits of mathematical reasoning. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

S., Che, Z., Zhang, C

Srinivasa, R. S., Che, Z., Zhang, C. B. C., Mares, D., Hernandez, E., Park, J., Lee, D., Mangialardi, G., Ng, C., Cardona, E.-Y . H., Gunjal, A., He, Y ., Liu, B., and Xing, C. TutorBench: A benchmark to assess tutoring capabilities of large language models.arXiv preprint arXiv:2510.02663,

work page arXiv
[11]

Let the expert stick to his last: Expert-specialized fine-tuning for sparse architectural large language models

Wang, Z., Chen, D., Dai, D., Xu, R., Li, Z., and Wu, Y . Let the expert stick to his last: Expert-specialized fine-tuning for sparse architectural large language models.arXiv preprint arXiv:2407.01906,

work page arXiv
[12]

2024 , url =

Wu, E., Wu, K., and Zou, J. FineTuneBench: How well do commercial fine-tuning apis infuse knowledge into LLMs?arXiv preprint arXiv:2411.05059,

work page arXiv
[13]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., ..., and Qiu, Z. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zelikman, E., Wu, Y ., Mu, J., and Goodman, N. D. STaR: Bootstrapping reasoning with reasoning.arXiv preprint arXiv:2203.14465,

work page arXiv