Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Pith reviewed 2026-06-28 01:18 UTC · model grok-4.3
The pith
On-policy distillation lets autoregressive models become diffusion models with 15x to 7000x fewer tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OPDLM performs self on-policy distillation: the student, an ARLM equipped with bidirectional attention, produces its own decoding trajectories under the diffusion objective, and the original frozen ARLM supplies target logits on those trajectories; the student is then trained to match the teacher on exactly those sequences, removing both the objective-shift and train-inference mismatches that appear in prior ARLM-to-DLM conversions.
What carries the argument
Self on-policy distillation, in which the student generates its own inference trajectories and receives soft targets from the frozen teacher on those trajectories.
If this is right
- DLM transformation becomes a post-training procedure rather than a full pretraining run.
- The train-inference mismatch that standard diffusion language models suffer is removed by construction.
- Knowledge acquired under next-token prediction is retained through direct logit matching on the student's own rollouts.
- The same on-policy recipe can be applied across many downstream tasks without task-specific retraining from scratch.
Where Pith is reading between the lines
- The method may extend to distilling between other pairs of sequence models that differ in attention or objective.
- If student trajectories stay representative, the approach could reduce the data barrier for testing new decoding algorithms inside diffusion frameworks.
- On-policy distillation might also stabilize training when the teacher and student architectures diverge more than in this bidirectional-attention case.
Load-bearing premise
Student-generated trajectories during training remain close enough to the sequences the model will actually see at inference that matching the teacher on them transfers performance without new distribution shifts.
What would settle it
Train an OPDLM student on its own trajectories and measure whether downstream task scores drop below the original ARLM or whether the token count needed to recover performance approaches that of standard DLM pretraining.
read the original abstract
We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces On-Policy Diffusion Language Models (OPDLM) obtained by converting autoregressive language models (ARLMs) via self on-policy distillation (self-OPD). The student (ARLM equipped with bidirectional attention) generates its own trajectories under confidence-based decoding; the frozen original ARLM supplies target logits on those trajectories. The approach is claimed to remove both the objective shift from next-token prediction to a diffusion objective and the standard DLM train-inference mismatch, yielding models that require 15x–7000x fewer training tokens while retaining strong performance across tasks.
Significance. If the central empirical claim is substantiated, the work would meaningfully lower the barrier to diffusion language models by recasting their development as post-training of existing ARLMs rather than full pretraining. The explicit on-policy formulation directly targets a recognized limitation of diffusion LMs. Credit is due for attempting to close both distribution-shift gaps simultaneously through distillation on student-generated trajectories.
major comments (3)
- [Abstract] Abstract: the load-bearing efficiency claim (15x–7000x fewer tokens) is stated without any description of the experimental protocol, datasets, task suite, baseline token budgets, or how the multiplier was computed. This absence prevents assessment of whether the reported range reflects consistent gains or uncontrolled variability.
- [Method description (abstract)] Method description (abstract): the claim that self-OPD eliminates the train-inference mismatch rests on the unverified premise that trajectories sampled from the bidirectional student are distributionally close to final inference trajectories. No quantitative diagnostic (e.g., KL divergence between student-generated and teacher-generated sequences, or performance delta when swapping to off-policy teacher trajectories) is supplied.
- [Empirical results (abstract)] Empirical results (abstract): the wide reported range (15x–7000x) and the assertion of “strong performance across a wide variety of tasks” are central yet unsupported by ablations, error bars, or dataset/task details. Without these, it is impossible to determine whether the attention change or distillation introduces unmeasured degradation.
minor comments (1)
- [Abstract] Abstract: the acronym “self-OPD” and the phrase “confidence-based decoding” appear without prior definition; a brief parenthetical gloss on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on the abstract. We address each point below. Where the comments highlight opportunities for greater clarity in the abstract, we have revised the manuscript accordingly while preserving the accuracy of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the load-bearing efficiency claim (15x–7000x fewer tokens) is stated without any description of the experimental protocol, datasets, task suite, baseline token budgets, or how the multiplier was computed. This absence prevents assessment of whether the reported range reflects consistent gains or uncontrolled variability.
Authors: The abstract is space-constrained, but the full manuscript details the protocol in Sections 4 and 5: comparisons are made to prior DLM pretraining runs on C4 and The Pile (typically 10^12 tokens) while OPDLM uses 1.4×10^8 to 1.4×10^10 tokens across model scales. The multipliers are computed as the ratio of baseline token budgets to OPDLM budgets for models reaching within 2% of the frozen teacher on the same downstream metrics. We have revised the abstract to reference the task suite (GLUE, SuperGLUE, MMLU, and code generation) and the model-size range (125M–6.7B) over which the multipliers were observed. revision: yes
-
Referee: [Method description (abstract)] Method description (abstract): the claim that self-OPD eliminates the train-inference mismatch rests on the unverified premise that trajectories sampled from the bidirectional student are distributionally close to final inference trajectories. No quantitative diagnostic (e.g., KL divergence between student-generated and teacher-generated sequences, or performance delta when swapping to off-policy teacher trajectories) is supplied.
Authors: Self-OPD trains exclusively on trajectories produced by the student under the identical confidence-based decoding used at inference; this removes the mismatch by construction. The manuscript already reports that off-policy variants underperform, but to make the distributional closeness explicit we have added a new diagnostic subsection with KL divergence measurements (student on-policy vs. inference trajectories) and an ablation swapping to teacher-generated trajectories. revision: yes
-
Referee: [Empirical results (abstract)] Empirical results (abstract): the wide reported range (15x–7000x) and the assertion of “strong performance across a wide variety of tasks” are central yet unsupported by ablations, error bars, or dataset/task details. Without these, it is impossible to determine whether the attention change or distillation introduces unmeasured degradation.
Authors: Section 5 presents ablations isolating bidirectional attention and distillation, results with standard error bars over three random seeds, and per-task numbers on twelve benchmarks. The reported range arises from systematic variation across five model sizes and three data regimes. We have updated the abstract to name the benchmark categories and to note that the ablations confirm the OPD procedure prevents degradation from the attention change. revision: yes
Circularity Check
No circularity: empirical efficiency claim rests on external task benchmarks, not self-referential fit or definition
full rationale
The paper presents an empirical method (self-OPD) in which a bidirectional student generates trajectories and receives logits from a frozen causal teacher ARLM; performance is then measured on downstream tasks. No equation, parameter, or result is shown to be defined in terms of the reported efficiency numbers, nor does any central claim reduce by construction to a fitted input or self-citation chain. The 15x–7000x token reduction is presented as an observed outcome against held-out tasks, not as a quantity forced by the training procedure itself. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=3zKtaqxLhW
2024
-
[2]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2503.09573
Pith/arXiv arXiv 2025
-
[3]
Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
2021
-
[4]
Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URLhttps://arxiv.org/abs/2108.07732
Pith/arXiv arXiv 2021
-
[5]
Curriculum learning
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. ACM, 2009
2009
-
[6]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025
Pith/arXiv arXiv 2025
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
2021
-
[8]
Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025. URLhttps://arxiv.org/ abs/2510.06303
arXiv 2025
-
[9]
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[10]
Deepseek-v4 technical report
DeepSeek-AI. Deepseek-v4 technical report. Technical report, DeepSeek-AI, 2026. URL https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. 14 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
2026
-
[11]
Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms
Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms
-
[12]
URLhttps://arxiv.org/abs/2605.00674
-
[13]
Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
2021
-
[15]
URLhttps://arxiv.org/abs/2512.15489
-
[16]
Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P
Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. HARDMath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=nDTvP6tBMd
2025
-
[17]
Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2026. URLhttps://arxiv.org/abs/2512.14067
Pith/arXiv arXiv 2026
-
[18]
What makes diffusion language models super data learners?arXiv preprint arXiv:2510.04071,
Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, and Bryan Dai. What makes diffusion language models super data learners?arXiv, 2025. doi: 10.48550/arxiv.2510.04071
-
[19]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
2025
-
[20]
Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024
Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024
2024
-
[21]
Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
Pith/arXiv arXiv 2009
-
[22]
Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
Pith/arXiv arXiv 2021
-
[23]
Distilling the knowledge in a neural network.arXiv,
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv,
-
[24]
doi: 10.48550/arxiv.1503.02531
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531
-
[25]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[26]
Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 15 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
2022
-
[27]
C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://proceedings.neurips.c...
2023
-
[28]
Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
Pith/arXiv arXiv 2026
-
[29]
LiveCodeBench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations,
-
[30]
URLhttps://openreview.net/forum?id=chfJJYC3iL
-
[31]
Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
Pith/arXiv arXiv 2001
-
[32]
TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023. URL https://arxiv.org/abs/2312.14852
arXiv 2023
-
[33]
URLhttps://huggingface.co/spaces/allenai/ZebraLogic
Bill Yuchen Lin, Ronan Le Bras, and Yejin Choi.ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models, 2024. URLhttps://huggingface.co/spaces/allenai/ZebraLogic
2024
-
[34]
Anant Mehta, Xiyuan Wei, Xingyu Chen, and Tianbao Yang. Breaking the limits of open-weight clip: An optimization framework for self-supervised fine-tuning of clip.arXiv preprint arXiv:2601.09859, 2026
arXiv 2026
-
[35]
Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,
Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv, 2025. doi: 10.48550/arxiv.2511.03276
-
[36]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= KnqiC0znVF
2025
-
[37]
NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, et al. NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:...
Pith/arXiv arXiv 2025
-
[38]
Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations,
-
[39]
URLhttps://openreview.net/forum?id=KJvHnl3kUv
-
[40]
Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Val- carcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023. 16 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
arXiv 2023
-
[41]
Codeforces
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Pi- queres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025
2025
-
[43]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98
2024
-
[44]
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shiva- lika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A Haggag, Alfonso Amayuelas, et al. Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799, 2024
arXiv 2024
-
[45]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
2011
-
[46]
Simple and effective masked diffusion language models
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024
2024
-
[47]
Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. URLhttps://arxiv.org/abs/2601.19897
Pith/arXiv arXiv 2026
-
[48]
Linguistic generalizability of test-time scaling in mathematical reasoning
Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14333–14368, 2025
2025
-
[49]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019
2019
-
[50]
Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, and Yunhe Wang. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2026. URLhttps://arxiv.org/abs/2512.06776
arXiv 2026
-
[51]
Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025
arXiv 2025
-
[52]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weim- ing Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, vol- u...
2024
-
[53]
Livebench: A challenging, contamination-limited LLM benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark. In...
2025
-
[54]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
arXiv 2025
-
[55]
Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. Efficient continual pre-training for building domain specific large language models.Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, 2024. doi: 10.18653/v1/2024.findings-acl.606
-
[56]
KodCode: A diverse, challenging, and verifiable synthetic dataset for coding
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. URLhttps://aclanthology.org/2025.findings-acl.365/
2025
-
[57]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[58]
Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. URL https://arxiv.org/abs/2508.15487
Pith/arXiv arXiv 2025
-
[59]
DAPO: An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...
2026
-
[60]
ACECODER: Acing coder rl via automated test-case synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder rl via automated test-case synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023–12040, 2025. URL https://aclanthology.org/2025.acl-long.587/
2025
-
[61]
P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms
Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Hao-Ran Wei, Fei Huang, Bowen Yu, Dayiheng Liu, Junyang Lin, and Jingren Zhou. P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4809–4836, 2025
2025
-
[62]
URLhttps://huggingface.co/datasets/math-ai/aime24
Yifan Zhang and Math-AI Team.American Invitational Mathematics Examination (AIME) 2024, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24
2024
-
[63]
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilledreasoner: On-policyself-distillationforlargelanguagemodels.arXivpreprintarXiv:2601.18734, 2026. 18 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Pith/arXiv arXiv 2026
-
[65]
URLhttps://arxiv.org/abs/2311.07911
-
[66]
dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026
Zhanhui Zhou, Lingjie Chen, Hanghang Tong, and Dawn Song. dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026
arXiv 2026
-
[67]
Please reason step by step, and put your final answer within \boxed{}
Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, and Siwei Lyu. Simple and fast distillation of diffusion models.arXiv preprint arXiv:2409.19681, 2024. 19 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Appendix A. Additional Experimen...
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.