pith. sign in

arxiv: 2606.31048 · v1 · pith:HZOIL32Ynew · submitted 2026-06-30 · 💻 cs.LG · cs.AI

Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition

Pith reviewed 2026-07-01 06:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords knowledge distillationchain of thoughtmathematical reasoningLoRA fine-tuningmath competition problemsmodel compressionresponse length
0
0 comments X

The pith

Fine-tuning a 7B model on chain-of-thought traces from a large reasoning model raises accuracy on math competition problems from 64.67% to 69.43%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether chain-of-thought reasoning generated by DeepSeek-R1 can be distilled into Qwen2.5-7B to improve performance on historical problems from the John O'Bryan Mathematics Competition. A dual-agent process creates the training corpus, after which LoRA fine-tuning runs for 200 iterations to limit overfitting. Five independent runs produce a mean accuracy gain of 4.76 points on the competition set and 73.1% on the separate MATH-500 benchmark. Accuracy falls steadily as response length shrinks from roughly 220 words at the longest level to 31 words at the shortest, with the largest drop in the two-person speed section.

Core claim

The central claim is that five independent 200-iteration LoRA runs on the CoT corpus yield a mean 69.43% accuracy (std 0.17%) on the competition problems versus the base model's 64.67%, with generalization to 73.1% (std 0.18%) on MATH-500, while accuracy declines monotonically across six response-length tiers from 69.43% at the longest tier to 41.9% at the shortest.

What carries the argument

The dual-agent framework that produces chain-of-thought training examples from the teacher model on competition problems, followed by LoRA adaptation limited to 200 iterations chosen from the validation-loss minimum observed in a preliminary 1,000-iteration run.

If this is right

  • The reported accuracy gain remains consistent across five random seeds with standard deviation below 0.2 points.
  • Performance gains transfer to the MATH-500 benchmark outside the original competition distribution.
  • Accuracy drops in a graded way as response length is reduced, with the speed section most sensitive.
  • Training beyond the 200-iteration point produces rising validation loss and overfitting on this corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Response length may act as an independent control on reasoning reliability after distillation, independent of the specific competition problems.
  • The same 200-iteration stopping heuristic could be tested on other teacher-student pairs or non-math reasoning tasks to check whether early stopping remains optimal.
  • If shorter outputs systematically lose accuracy, future distillation pipelines might add explicit length or step-count objectives rather than relying on raw CoT traces alone.

Load-bearing premise

That stopping training at the iteration where validation loss reaches its minimum on a preliminary run yields stable gains rather than an artifact of that particular stopping rule on this dataset.

What would settle it

A new set of runs that continue past 200 iterations until validation loss stabilizes or rises further, then check whether the accuracy improvement over the base model shrinks below statistical significance or reverses.

Figures

Figures reproduced from arXiv: 2606.31048 by Aaditya Khanal, Gaurab Baral, Junxiu Zhou, Yangyang Tao.

Figure 1
Figure 1. Figure 1: Average problem difficulty by year (DeepSeek-R1 ratings, scale 1–10). Darker bars are at or above the mean (5.19). 5 Evaluation Metrics Answer accuracy is the share of problems where the model’s answer matches the ground truth: Accuracy = Number of Correct Answers Total Problems Judged (2) Because mathematical answers can appear in multiple equivalent forms, correctness is assessed by a separate DeepSeek-R… view at source ↗
Figure 2
Figure 2. Figure 2: shows the training and validation loss curves from the diagnostic 1,000-iteration run. Training loss fell from 0.476 at iteration 10 to near-zero by iteration 900. Validation loss followed a U-shaped trajectory: it reached a minimum of 0.374 at iteration 200, then rose steadily to 0.826 at iteration 1,000 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows per-year accuracy across reasoning levels. Year 2017 stands out as consistently strong, ranging from 51.2% (R6) to 74.4% (R2), consistent with its lowest average difficulty score of 4.51 ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

This paper investigates knowledge distillation from a large reasoning model (DeepSeek-R1) to a compact student model (Qwen2.5-7B). Using historical problems from the John O'Bryan Mathematics Competition at Northern Kentucky University (2011-2025), we build a Chain-of-Thought (CoT) training corpus through a dual-agent framework. The dataset is used to fine-tune the student model with Low-Rank Adaptation (LoRA) on Apple Silicon hardware using the MLX framework. The base Qwen2.5-7B model achieves 64.67% accuracy on competition problems, while the DeepSeek-R1 teacher achieves 91.40%. An initial 1,000-iteration training run revealed severe overfitting, with validation loss reaching a minimum at iteration 200 before rising steadily. Based on this finding, we ran five independent training runs each limited to 200 iterations with varied random seeds to assess result stability. Across these five runs, the fine-tuned student model achieves a mean accuracy of 69.43% (std dev 0.17%) on the competition dataset, a 4.76 percentage-point improvement over the base model, and generalizes to 73.1% (std dev 0.18%) on the MATH-500 benchmark. We further study how response length affects answer quality across six reasoning levels (R1-R6): accuracy declines consistently from 69.43% at R1 (mean 220 words) to 41.9% at R6 (mean 31.2 words), with the two-person speed section most sensitive to token reduction. These results demonstrate that CoT distillation improves compact student models and that response length is a critical factor in mathematical reasoning quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates knowledge distillation of chain-of-thought reasoning from DeepSeek-R1 to Qwen2.5-7B via LoRA on John O'Bryan Mathematics Competition problems (2011-2025). After a preliminary 1,000-iteration run identified a validation-loss minimum at iteration 200, five independent 200-iteration runs yield a mean accuracy of 69.43% (std 0.17%) on the competition set (4.76 pp above the 64.67% base model), 73.1% (std 0.18%) on MATH-500, and a monotonic accuracy drop with shorter responses across six reasoning levels.

Significance. If the reported gains survive a properly isolated stopping rule, the work supplies concrete evidence that compact models can acquire improved mathematical reasoning via distillation from large reasoning models, with measurable generalization to MATH-500 and an explicit demonstration that response length modulates answer quality. The five-run protocol with reported standard deviations and the preliminary overfitting diagnosis are positive methodological features.

major comments (3)
  1. [Training Setup] Training Setup: the 200-iteration stopping point is selected post-hoc from the validation-loss minimum observed in a preliminary 1,000-iteration run performed on the identical problem distribution; the five seed-varied runs quantify variance only conditional on this fixed rule, so the 4.76 pp gain may partly reflect selection on the observed loss trajectory rather than a robust distillation effect.
  2. [Results] Results: the 4.76 percentage-point improvement is presented without a statistical significance test (paired t-test across seeds, bootstrap CI on the mean, or similar), leaving open whether the delta reliably exceeds sampling variation given the reported std dev of 0.17%.
  3. [Dataset Construction] Dataset Construction: the manuscript provides no details on the train/validation/test split sizes, how the held-out evaluation set was chosen, or whether the validation set used for the preliminary 1,000-iteration run overlaps with the final test problems.
minor comments (2)
  1. [Title / Abstract] The competition name is spelled without the apostrophe in the title and abstract; consistent use of 'O'Bryan' would improve readability.
  2. [Results] The response-length analysis (R1-R6) would be clearer if accompanied by a table listing exact mean word counts and accuracies per level rather than only narrative description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive note on the methodological features such as the five-run protocol and overfitting diagnosis. We address each major comment below and will revise the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [Training Setup] Training Setup: the 200-iteration stopping point is selected post-hoc from the validation-loss minimum observed in a preliminary 1,000-iteration run performed on the identical problem distribution; the five seed-varied runs quantify variance only conditional on this fixed rule, so the 4.76 pp gain may partly reflect selection on the observed loss trajectory rather than a robust distillation effect.

    Authors: We agree this is a valid methodological concern. The preliminary run served to identify the onset of overfitting on the given distribution, after which the stopping rule was fixed at 200 iterations for the five independent runs. This does condition the reported variance on a data-dependent choice. In the revision we will add an explicit limitations paragraph discussing this post-hoc selection and note that more rigorous protocols such as nested validation could be used in follow-up work. revision: yes

  2. Referee: [Results] Results: the 4.76 percentage-point improvement is presented without a statistical significance test (paired t-test across seeds, bootstrap CI on the mean, or similar), leaving open whether the delta reliably exceeds sampling variation given the reported std dev of 0.17%.

    Authors: We concur that a formal test is warranted. The observed mean gain of 4.76 pp with a standard deviation of only 0.17% across five seeds already suggests low variability, but we will add either a one-sample t-test on the per-seed improvements or a bootstrap confidence interval for the mean accuracy in the revised results section to quantify statistical reliability. revision: yes

  3. Referee: [Dataset Construction] Dataset Construction: the manuscript provides no details on the train/validation/test split sizes, how the held-out evaluation set was chosen, or whether the validation set used for the preliminary 1,000-iteration run overlaps with the final test problems.

    Authors: We will supply the missing details in the revision. The full paper will state the exact sizes of the train, validation, and test partitions (allocated by competition year to preserve temporal separation), describe the random or year-based selection procedure for the held-out set, and explicitly confirm that the validation problems used in the preliminary 1,000-iteration run have no overlap with the final test problems. revision: yes

Circularity Check

0 steps flagged

No circularity; results are independent empirical measurements

full rationale

The paper's central claims consist of measured accuracies on held-out competition problems and MATH-500 after LoRA fine-tuning. The 200-iteration stopping point is selected from a preliminary run on the same distribution, but this does not algebraically force the reported accuracy values or reduce any prediction to its inputs by construction. No equations, self-citations, or ansatzes are invoked in a load-bearing manner for the accuracy figures; the evaluation remains external to the training procedure itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the dual-agent CoT corpus is high-quality training data and that the post-hoc 200-iteration cutoff yields generalizable rather than overfit performance; no new physical or mathematical entities are introduced.

free parameters (2)
  • training iterations = 200
    200 iterations selected after inspecting the validation-loss curve from an initial longer run on the same data.
  • LoRA rank and learning-rate schedule
    Standard LoRA hyperparameters whose exact values are not reported but implicitly tuned for the observed results.
axioms (1)
  • domain assumption The dual-agent framework produces sufficiently accurate and pedagogically useful Chain-of-Thought traces for the student model to learn from.
    Invoked when constructing the training corpus from competition problems.

pith-pipeline@v0.9.1-grok · 5872 in / 1613 out tokens · 42718 ms · 2026-07-01T06:10:22.593232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    GitHub repository (2023)

    Apple Inc.: MLX: An array framework for Apple silicon. GitHub repository (2023). https://github.com/ml-explore/mlx

  2. [2]

    International Journal of Computer Vision 129(6), 1789–1819 (2021)

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision 129(6), 1789–1819 (2021). https: //doi.org/10.1007/s11263-021-01453-z

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).https://arxiv. org/abs/2501.12948

  4. [4]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).https://arxiv.org/abs/1503.02531

  5. [5]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the MATH dataset. In: Proceedings of the NeurIPS 2021 Datasets and Benchmarks Track (2021). https://arxiv.org/abs/2103.03874

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).https: //arxiv.org/abs/2110.14168

  7. [7]

    arXiv preprint arXiv:2403.09053 (2024).https://arxiv.org/abs/2403.09053

    Boix-Adsera, E.: Towards a theory of model distillation. arXiv preprint arXiv:2403.09053 (2024).https://arxiv.org/abs/2403.09053

  8. [8]

    arXiv preprint at arXiv:2311.01460 , year=

    Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., Shieber, S.: Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460 (2023).https://arxiv.org/abs/2311.01460

  9. [9]

    arXiv preprint arXiv:2212.10071 (2023).https://arxiv.org/abs/2212.10071

    Ho, N., Schmid, L., Yun, S.-Y.: Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 (2023).https://arxiv.org/abs/2212.10071

  10. [10]

    In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), pp

    Magister, L.C., Mallinson, J., Adamek, J., Malmi, E., Severyn, A.: Teaching small language models to reason. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), pp. 14209–14223 (2023). https://arxiv.org/abs/2212.08410 14

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations (ICLR 2022) (2022). https://arxiv.org/abs/2106.09685

  12. [12]

    Instruction Tuning for Large Language Models: A Survey

    Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., Wang, G.: Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).https://arxiv.org/abs/2308.10792

  13. [13]

    Nature Machine Intelligence 5, 220–235 (2023).https://doi

    Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.M., Chen, W., Yi, X., Zhao, W., Wang, L., Liu, Z., Zheng, H., Chen, J., Liu, Y., Tang, J., Li, J., Sun, M.: Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5, 220–235 (2023).https://doi. org/10.1038/s42256-023-00626-4

  14. [14]

    Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 79174–79202.https://arxiv.org/abs/2507.07104

    Zhang, T., Li, Y., Chou, Y.-C., Chen, J., Yuille, A., Wei, C., Xiao, J.: Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 79174–79202.https://arxiv.org/abs/2507.07104

  15. [15]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (2022).https://arxiv.org/abs/2201.11903

  16. [16]

    arXiv preprint arXiv:2305.16366 (2023).https: //arxiv.org/abs/2305.16366

    Zhao, X., Li, W., Kong, L.: Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. arXiv preprint arXiv:2305.16366 (2023).https: //arxiv.org/abs/2305.16366

  17. [17]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023).https://arxiv.org/abs/2305.20050 15