Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Atharv Chagi; Degui Zhi; Dileep Kalathil; Jacob Helwig; James Caverlee; Lakshmi Jotsna; Shubham Parashar; Shuiwang Ji; Xingyu Su

arxiv: 2606.06712 · v1 · pith:EHWICTYBnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su , Jacob Helwig , Shubham Parashar , Atharv Chagi , Lakshmi Jotsna , Degui Zhi , James Caverlee , Dileep Kalathil

show 1 more author

Shuiwang Ji

This is my paper

Pith reviewed 2026-06-28 01:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords autoregressive language modelsdiffusion language modelson-policy distillationknowledge distillationtrain-inference mismatchdata-efficient trainingmodel transformation

0 comments

The pith

On-policy distillation lets autoregressive models become diffusion models with 15x to 7000x fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert an autoregressive language model into a diffusion language model without discarding its original knowledge or creating a train-inference gap. Instead of random masking or full pretraining, the student model (now with bidirectional attention) generates its own inference-style trajectories, and the frozen original model supplies the target logits on those exact sequences. Training proceeds directly on this on-policy data, so the resulting model keeps task performance while needing far less additional data. A reader would care because this turns diffusion language models from an expensive pretraining project into a lightweight post-training step.

Core claim

OPDLM performs self on-policy distillation: the student, an ARLM equipped with bidirectional attention, produces its own decoding trajectories under the diffusion objective, and the original frozen ARLM supplies target logits on those trajectories; the student is then trained to match the teacher on exactly those sequences, removing both the objective-shift and train-inference mismatches that appear in prior ARLM-to-DLM conversions.

What carries the argument

Self on-policy distillation, in which the student generates its own inference trajectories and receives soft targets from the frozen teacher on those trajectories.

If this is right

DLM transformation becomes a post-training procedure rather than a full pretraining run.
The train-inference mismatch that standard diffusion language models suffer is removed by construction.
Knowledge acquired under next-token prediction is retained through direct logit matching on the student's own rollouts.
The same on-policy recipe can be applied across many downstream tasks without task-specific retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend to distilling between other pairs of sequence models that differ in attention or objective.
If student trajectories stay representative, the approach could reduce the data barrier for testing new decoding algorithms inside diffusion frameworks.
On-policy distillation might also stabilize training when the teacher and student architectures diverge more than in this bidirectional-attention case.

Load-bearing premise

Student-generated trajectories during training remain close enough to the sequences the model will actually see at inference that matching the teacher on them transfers performance without new distribution shifts.

What would settle it

Train an OPDLM student on its own trajectories and measure whether downstream task scores drop below the original ARLM or whether the token count needed to recover performance approaches that of standard DLM pretraining.

read the original abstract

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's on-policy self-distillation from a frozen AR teacher on student-generated trajectories is a direct attempt to fix both objective shift and train-inference mismatch when converting ARLMs to DLMs, but the 15x-7000x token savings rest on unshown experimental controls.

read the letter

The core contribution is using on-policy distillation where the student (bidirectional-attention AR model) samples its own trajectories and the frozen original AR model provides target logits on those sequences. This targets the two shifts mentioned: loss of AR knowledge under a new objective and the gap between random masking at training and confidence decoding at inference.

What stands out is the framing itself. Prior conversion work apparently just swaps attention and retrains with a diffusion loss; here they keep the teacher fixed and force the student to train on its own inference-like paths. That is a clean way to make the training distribution closer to what will actually be used.

The efficiency numbers are the part that needs checking. A 15x to 7000x reduction in tokens is a large range, and without ablations, dataset sizes, or variance numbers it is difficult to tell how much comes from the on-policy trick versus other factors like model scale or task choice. The assumption that student-generated trajectories stay close enough to final inference behavior, and that logit matching transfers performance without introducing attention-induced shifts, is load-bearing but not obviously verified in the abstract.

The work is aimed at groups already running large AR models who want to experiment with diffusion-style generation without starting from scratch. Readers who care about non-autoregressive sampling or post-training conversion methods could find the setup useful if the controls hold.

It is worth sending to referees. The idea is specific enough that a review can ask for the missing trajectory-distribution checks and task-level breakdowns; the central claim is falsifiable once the experiments are visible.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces On-Policy Diffusion Language Models (OPDLM) obtained by converting autoregressive language models (ARLMs) via self on-policy distillation (self-OPD). The student (ARLM equipped with bidirectional attention) generates its own trajectories under confidence-based decoding; the frozen original ARLM supplies target logits on those trajectories. The approach is claimed to remove both the objective shift from next-token prediction to a diffusion objective and the standard DLM train-inference mismatch, yielding models that require 15x–7000x fewer training tokens while retaining strong performance across tasks.

Significance. If the central empirical claim is substantiated, the work would meaningfully lower the barrier to diffusion language models by recasting their development as post-training of existing ARLMs rather than full pretraining. The explicit on-policy formulation directly targets a recognized limitation of diffusion LMs. Credit is due for attempting to close both distribution-shift gaps simultaneously through distillation on student-generated trajectories.

major comments (3)

[Abstract] Abstract: the load-bearing efficiency claim (15x–7000x fewer tokens) is stated without any description of the experimental protocol, datasets, task suite, baseline token budgets, or how the multiplier was computed. This absence prevents assessment of whether the reported range reflects consistent gains or uncontrolled variability.
[Method description (abstract)] Method description (abstract): the claim that self-OPD eliminates the train-inference mismatch rests on the unverified premise that trajectories sampled from the bidirectional student are distributionally close to final inference trajectories. No quantitative diagnostic (e.g., KL divergence between student-generated and teacher-generated sequences, or performance delta when swapping to off-policy teacher trajectories) is supplied.
[Empirical results (abstract)] Empirical results (abstract): the wide reported range (15x–7000x) and the assertion of “strong performance across a wide variety of tasks” are central yet unsupported by ablations, error bars, or dataset/task details. Without these, it is impossible to determine whether the attention change or distillation introduces unmeasured degradation.

minor comments (1)

[Abstract] Abstract: the acronym “self-OPD” and the phrase “confidence-based decoding” appear without prior definition; a brief parenthetical gloss on first use would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the abstract. We address each point below. Where the comments highlight opportunities for greater clarity in the abstract, we have revised the manuscript accordingly while preserving the accuracy of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the load-bearing efficiency claim (15x–7000x fewer tokens) is stated without any description of the experimental protocol, datasets, task suite, baseline token budgets, or how the multiplier was computed. This absence prevents assessment of whether the reported range reflects consistent gains or uncontrolled variability.

Authors: The abstract is space-constrained, but the full manuscript details the protocol in Sections 4 and 5: comparisons are made to prior DLM pretraining runs on C4 and The Pile (typically 10^12 tokens) while OPDLM uses 1.4×10^8 to 1.4×10^10 tokens across model scales. The multipliers are computed as the ratio of baseline token budgets to OPDLM budgets for models reaching within 2% of the frozen teacher on the same downstream metrics. We have revised the abstract to reference the task suite (GLUE, SuperGLUE, MMLU, and code generation) and the model-size range (125M–6.7B) over which the multipliers were observed. revision: yes
Referee: [Method description (abstract)] Method description (abstract): the claim that self-OPD eliminates the train-inference mismatch rests on the unverified premise that trajectories sampled from the bidirectional student are distributionally close to final inference trajectories. No quantitative diagnostic (e.g., KL divergence between student-generated and teacher-generated sequences, or performance delta when swapping to off-policy teacher trajectories) is supplied.

Authors: Self-OPD trains exclusively on trajectories produced by the student under the identical confidence-based decoding used at inference; this removes the mismatch by construction. The manuscript already reports that off-policy variants underperform, but to make the distributional closeness explicit we have added a new diagnostic subsection with KL divergence measurements (student on-policy vs. inference trajectories) and an ablation swapping to teacher-generated trajectories. revision: yes
Referee: [Empirical results (abstract)] Empirical results (abstract): the wide reported range (15x–7000x) and the assertion of “strong performance across a wide variety of tasks” are central yet unsupported by ablations, error bars, or dataset/task details. Without these, it is impossible to determine whether the attention change or distillation introduces unmeasured degradation.

Authors: Section 5 presents ablations isolating bidirectional attention and distillation, results with standard error bars over three random seeds, and per-task numbers on twelve benchmarks. The reported range arises from systematic variation across five model sizes and three data regimes. We have updated the abstract to name the benchmark categories and to note that the ablations confirm the OPD procedure prevents degradation from the attention change. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical efficiency claim rests on external task benchmarks, not self-referential fit or definition

full rationale

The paper presents an empirical method (self-OPD) in which a bidirectional student generates trajectories and receives logits from a frozen causal teacher ARLM; performance is then measured on downstream tasks. No equation, parameter, or result is shown to be defined in terms of the reported efficiency numbers, nor does any central claim reduce by construction to a fitted input or self-citation chain. The 15x–7000x token reduction is presented as an observed outcome against held-out tasks, not as a quantity forced by the training procedure itself. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities beyond the high-level method description; the OPD procedure itself is the core contribution rather than new postulated objects.

pith-pipeline@v0.9.1-grok · 5852 in / 1103 out tokens · 22136 ms · 2026-06-28T01:18:55.236450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 4 canonical work pages · 1 internal anchor

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=3zKtaqxLhW

2024
[2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2503.09573

Pith/arXiv arXiv 2025
[3]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

2021
[4]

Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URLhttps://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021
[5]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. ACM, 2009

2009
[6]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

Pith/arXiv arXiv 2025
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021
[8]

SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025. URLhttps://arxiv.org/ abs/2510.06303

arXiv 2025
[9]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[10]

Deepseek-v4 technical report

DeepSeek-AI. Deepseek-v4 technical report. Technical report, DeepSeek-AI, 2026. URL https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. 14 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

2026
[11]

Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms
[12]

URLhttps://arxiv.org/abs/2605.00674

Pith/arXiv arXiv
[13]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[15]

URLhttps://arxiv.org/abs/2512.15489

arXiv
[16]

Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. HARDMath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=nDTvP6tBMd

2025
[17]

Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2026

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2026. URLhttps://arxiv.org/abs/2512.14067

Pith/arXiv arXiv 2026
[18]

What makes diffusion language models super data learners?arXiv preprint arXiv:2510.04071,

Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, and Bryan Dai. What makes diffusion language models super data learners?arXiv, 2025. doi: 10.48550/arxiv.2510.04071

work page doi:10.48550/arxiv.2510.04071 2025
[19]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[20]

Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024

Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024

2024
[21]

Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009
[22]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021
[23]

Distilling the knowledge in a neural network.arXiv,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv,
[24]

doi: 10.48550/arxiv.1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531
[25]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[26]

Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 15 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

2022
[27]

C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://proceedings.neurips.c...

2023
[28]

Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026
[29]

LiveCodeBench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations,
[30]

URLhttps://openreview.net/forum?id=chfJJYC3iL
[31]

Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[32]

TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023. URL https://arxiv.org/abs/2312.14852

arXiv 2023
[33]

URLhttps://huggingface.co/spaces/allenai/ZebraLogic

Bill Yuchen Lin, Ronan Le Bras, and Yejin Choi.ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models, 2024. URLhttps://huggingface.co/spaces/allenai/ZebraLogic

2024
[34]

Breaking the limits of open-weight clip: An optimization framework for self-supervised fine-tuning of clip.arXiv preprint arXiv:2601.09859, 2026

Anant Mehta, Xiyuan Wei, Xingyu Chen, and Tianbao Yang. Breaking the limits of open-weight clip: An optimization framework for self-supervised fine-tuning of clip.arXiv preprint arXiv:2601.09859, 2026

arXiv 2026
[35]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv, 2025. doi: 10.48550/arxiv.2511.03276

work page doi:10.48550/arxiv.2511.03276 2025
[36]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= KnqiC0znVF

2025
[37]

NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, et al. NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:...

Pith/arXiv arXiv 2025
[38]

Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations,
[39]

URLhttps://openreview.net/forum?id=KJvHnl3kUv
[40]

Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Val- carcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023. 16 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

arXiv 2023
[41]

Codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Pi- queres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

2025
[43]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98

2024
[44]

Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799, 2024

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shiva- lika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A Haggag, Alfonso Amayuelas, et al. Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799, 2024

arXiv 2024
[45]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[46]

Simple and effective masked diffusion language models

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024
[47]

Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. URLhttps://arxiv.org/abs/2601.19897

Pith/arXiv arXiv 2026
[48]

Linguistic generalizability of test-time scaling in mathematical reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14333–14368, 2025

2025
[49]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

2019
[50]

From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2026

Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, and Yunhe Wang. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2026. URLhttps://arxiv.org/abs/2512.06776

arXiv 2026
[51]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

arXiv 2025
[52]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weim- ing Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, vol- u...

2024
[53]

Livebench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark. In...

2025
[54]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

arXiv 2025
[55]

Efficient continual pre-training for building domain specific large language models.Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, 2024

Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. Efficient continual pre-training for building domain specific large language models.Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, 2024. doi: 10.18653/v1/2024.findings-acl.606

work page doi:10.18653/v1/2024.findings-acl.606 2024
[56]

KodCode: A diverse, challenging, and verifiable synthetic dataset for coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. URLhttps://aclanthology.org/2025.findings-acl.365/

2025
[57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[58]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. URL https://arxiv.org/abs/2508.15487

Pith/arXiv arXiv 2025
[59]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

2026
[60]

ACECODER: Acing coder rl via automated test-case synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder rl via automated test-case synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023–12040, 2025. URL https://aclanthology.org/2025.acl-long.587/

2025
[61]

P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms

Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Hao-Ran Wei, Fei Huang, Bowen Yu, Dayiheng Liu, Junyang Lin, and Jingren Zhou. P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4809–4836, 2025

2025
[62]

URLhttps://huggingface.co/datasets/math-ai/aime24

Yifan Zhang and Math-AI Team.American Invitational Mathematics Examination (AIME) 2024, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24

2024
[63]

Self- distilledreasoner: On-policyself-distillationforlargelanguagemodels.arXivpreprintarXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilledreasoner: On-policyself-distillationforlargelanguagemodels.arXivpreprintarXiv:2601.18734, 2026. 18 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Pith/arXiv arXiv 2026
[65]

URLhttps://arxiv.org/abs/2311.07911

Pith/arXiv arXiv
[66]

dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026

Zhanhui Zhou, Lingjie Chen, Hanghang Tong, and Dawn Song. dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026

arXiv 2026
[67]

Please reason step by step, and put your final answer within \boxed{}

Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, and Siwei Lyu. Simple and fast distillation of diffusion models.arXiv preprint arXiv:2409.19681, 2024. 19 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Appendix A. Additional Experimen...

arXiv 2024

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=3zKtaqxLhW

2024

[2] [2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2503.09573

Pith/arXiv arXiv 2025

[3] [3]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

2021

[4] [4]

Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URLhttps://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021

[5] [5]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. ACM, 2009

2009

[6] [6]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

Pith/arXiv arXiv 2025

[7] [7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

2021

[8] [8]

SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025. URLhttps://arxiv.org/ abs/2510.06303

arXiv 2025

[9] [9]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[10] [10]

Deepseek-v4 technical report

DeepSeek-AI. Deepseek-v4 technical report. Technical report, DeepSeek-AI, 2026. URL https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. 14 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

2026

[11] [11]

Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms

[12] [12]

URLhttps://arxiv.org/abs/2605.00674

Pith/arXiv arXiv

[13] [13]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021

[14] [15]

URLhttps://arxiv.org/abs/2512.15489

arXiv

[15] [16]

Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. HARDMath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=nDTvP6tBMd

2025

[16] [17]

Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2026

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2026. URLhttps://arxiv.org/abs/2512.14067

Pith/arXiv arXiv 2026

[17] [18]

What makes diffusion language models super data learners?arXiv preprint arXiv:2510.04071,

Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, and Bryan Dai. What makes diffusion language models super data learners?arXiv, 2025. doi: 10.48550/arxiv.2510.04071

work page doi:10.48550/arxiv.2510.04071 2025

[18] [19]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[19] [20]

Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024

Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.bioRxiv, pages 2024–07, 2024

2024

[20] [21]

Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009

[21] [22]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021

[22] [23]

Distilling the knowledge in a neural network.arXiv,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv,

[23] [24]

doi: 10.48550/arxiv.1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531

[24] [25]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[25] [26]

Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 15 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

2022

[26] [27]

C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://proceedings.neurips.c...

2023

[27] [28]

Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026

[28] [29]

LiveCodeBench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations,

[29] [30]

URLhttps://openreview.net/forum?id=chfJJYC3iL

[30] [31]

Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[31] [32]

TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023. URL https://arxiv.org/abs/2312.14852

arXiv 2023

[32] [33]

URLhttps://huggingface.co/spaces/allenai/ZebraLogic

Bill Yuchen Lin, Ronan Le Bras, and Yejin Choi.ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models, 2024. URLhttps://huggingface.co/spaces/allenai/ZebraLogic

2024

[33] [34]

Breaking the limits of open-weight clip: An optimization framework for self-supervised fine-tuning of clip.arXiv preprint arXiv:2601.09859, 2026

Anant Mehta, Xiyuan Wei, Xingyu Chen, and Tianbao Yang. Breaking the limits of open-weight clip: An optimization framework for self-supervised fine-tuning of clip.arXiv preprint arXiv:2601.09859, 2026

arXiv 2026

[34] [35]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv, 2025. doi: 10.48550/arxiv.2511.03276

work page doi:10.48550/arxiv.2511.03276 2025

[35] [36]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= KnqiC0znVF

2025

[36] [37]

NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, et al. NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:...

Pith/arXiv arXiv 2025

[37] [38]

Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations,

[38] [39]

URLhttps://openreview.net/forum?id=KJvHnl3kUv

[39] [40]

Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Val- carcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023. 16 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

arXiv 2023

[40] [41]

Codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Pi- queres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

2025

[41] [43]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98

2024

[42] [44]

Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799, 2024

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shiva- lika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A Haggag, Alfonso Amayuelas, et al. Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799, 2024

arXiv 2024

[43] [45]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[44] [46]

Simple and effective masked diffusion language models

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024

[45] [47]

Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. URLhttps://arxiv.org/abs/2601.19897

Pith/arXiv arXiv 2026

[46] [48]

Linguistic generalizability of test-time scaling in mathematical reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14333–14368, 2025

2025

[47] [49]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

2019

[48] [50]

From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2026

Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, and Yunhe Wang. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2026. URLhttps://arxiv.org/abs/2512.06776

arXiv 2026

[49] [51]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

arXiv 2025

[50] [52]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weim- ing Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, vol- u...

2024

[51] [53]

Livebench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark. In...

2025

[52] [54]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

arXiv 2025

[53] [55]

Efficient continual pre-training for building domain specific large language models.Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, 2024

Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. Efficient continual pre-training for building domain specific large language models.Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, 2024. doi: 10.18653/v1/2024.findings-acl.606

work page doi:10.18653/v1/2024.findings-acl.606 2024

[54] [56]

KodCode: A diverse, challenging, and verifiable synthetic dataset for coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. URLhttps://aclanthology.org/2025.findings-acl.365/

2025

[55] [57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[56] [58]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. URL https://arxiv.org/abs/2508.15487

Pith/arXiv arXiv 2025

[57] [59]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

2026

[58] [60]

ACECODER: Acing coder rl via automated test-case synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder rl via automated test-case synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023–12040, 2025. URL https://aclanthology.org/2025.acl-long.587/

2025

[59] [61]

P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms

Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Hao-Ran Wei, Fei Huang, Bowen Yu, Dayiheng Liu, Junyang Lin, and Jingren Zhou. P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4809–4836, 2025

2025

[60] [62]

URLhttps://huggingface.co/datasets/math-ai/aime24

Yifan Zhang and Math-AI Team.American Invitational Mathematics Examination (AIME) 2024, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24

2024

[61] [63]

Self- distilledreasoner: On-policyself-distillationforlargelanguagemodels.arXivpreprintarXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilledreasoner: On-policyself-distillationforlargelanguagemodels.arXivpreprintarXiv:2601.18734, 2026. 18 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Pith/arXiv arXiv 2026

[62] [65]

URLhttps://arxiv.org/abs/2311.07911

Pith/arXiv arXiv

[63] [66]

dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026

Zhanhui Zhou, Lingjie Chen, Hanghang Tong, and Dawn Song. dllm: Simple diffusion language modeling.arXiv preprint arXiv:2602.22661, 2026

arXiv 2026

[64] [67]

Please reason step by step, and put your final answer within \boxed{}

Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, and Siwei Lyu. Simple and fast distillation of diffusion models.arXiv preprint arXiv:2409.19681, 2024. 19 Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Appendix A. Additional Experimen...

arXiv 2024