Masked Language Flow Models

Iskander Azangulov; Kianoosh Ashouritaklimi; Leo Zhang; Patrick Rebeschini; Simon Vary

arxiv: 2606.27617 · v1 · pith:5VNZEUMTnew · submitted 2026-06-26 · 💻 cs.CL · cs.LG

Masked Language Flow Models

Iskander Azangulov , Kianoosh Ashouritaklimi , Leo Zhang , Simon Vary , Patrick Rebeschini This is my paper

Pith reviewed 2026-06-29 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords masked language flow modelsflow language modelsmasked diffusion modelsstochastic interpolantreasoning tasksinstruction followingcontinuous flowsalternating sampler

0 comments

The pith

Masked Language Flow Models combine masking with continuous flows so language models can perform multi-step reasoning without decoding every token upfront.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous flow language models learn a continuous transport from noise to clean sequences but must decode all tokens at once, which hinders tasks that need iterative reasoning. Masked Language Flow Models add a continuous stochastic interpolant that connects partially masked sequences to full ones, letting the model generate conditionally and unmask tokens selectively during sampling. A new procedure alternates continuous denoising steps with discrete unmasking of high-confidence tokens. On GSM8K and MT-Bench this produces the first reported success of flow-based models on reasoning and instruction-following benchmarks.

Core claim

MLFMs extend flow language models by inserting masking through a continuous stochastic interpolant that transports between partially masked and clean token sequences in Euclidean space. The resulting flow map supports conditional generation, admits lightweight conversion from pretrained masked diffusion models, and pairs with an alternating sampler that interleaves continuous denoising and discrete unmasking of confident tokens.

What carries the argument

The continuous stochastic interpolant that bridges partially masked sequences and clean sequences so the learned flow supports selective unmasking.

If this is right

Pretrained masked diffusion models convert to MLFMs with only lightweight adaptation.
Continuous flows now support conditional generation without forcing full-token decoding at every step.
The alternating sampler of continuous denoising and discrete unmasking enables multi-step reasoning.
Flow-based models reach usable performance on GSM8K math reasoning and MT-Bench instruction following.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hybrid continuous-discrete sampler may extend naturally to other sequence tasks that mix local certainty with global structure.
Because the conversion from masked diffusion models is lightweight, existing diffusion checkpoints become immediate starting points for flow-based reasoning systems.
The approach suggests that future language models could switch between fully continuous and partially discrete regimes depending on the reasoning depth required.

Load-bearing premise

The continuous stochastic interpolant creates a usable bridge between masked and clean sequences that preserves the advantages of flow while enabling conditional, multi-step generation.

What would settle it

If the resulting models show no improvement over prior flow language models on GSM8K accuracy or MT-Bench win rates, the claim that masking plus interpolation overcomes the token-decoding barrier would be refuted.

Figures

Figures reproduced from arXiv: 2606.27617 by Iskander Azangulov, Kianoosh Ashouritaklimi, Leo Zhang, Patrick Rebeschini, Simon Vary.

**Figure 2.** Figure 2: Qualitative example of MLFM performing conditional generation on MT-Bench. [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative example of MLFM performing conditional generation on GSM8K. [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

read the original abstract

Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions -- an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The interpolant-plus-alternating-sampler construction is a real attempt to fix FLM reasoning limits, but the abstract supplies zero evidence it works.

read the letter

The paper's main move is to embed masking into flow language models via a continuous stochastic interpolant that connects partially masked sequences to clean ones, plus a sampler that flips between continuous denoising and discrete unmasking of high-confidence tokens. This is meant to let FLMs do conditional generation and multi-step reasoning without having to decode every token up front, and it claims a lightweight way to turn pretrained MDMs into the new models.

What is actually new is the interpolant construction itself and the hybrid sampler; those pieces are not just a re-labeling of existing MDM or FLM work. The abstract also correctly flags the position-factorization problem in few-step MDM sampling and the full-decode requirement in FLMs.

The soft spots are the entire evaluation. No derivations, no training details, no ablation on the interpolant or the discrete step, no numbers on GSM8K or MT-Bench, and no check on whether the unmasking step preserves the flow properties needed for reasoning. The stress-test worry that the discrete unmasking could reintroduce token-position issues or force full decoding looks live precisely because nothing in the abstract rules it out.

This is for people already deep in diffusion and flow language models who want to see if the hybrid idea can be made concrete. A reader looking for a working method or reproducible result will not get much yet.

I would not send it to referees in its current form; the central claim needs at least the experimental section and some verification that the sampler does not break the flow before it is worth referee time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Masked Language Flow Models (MLFMs) that extend Flow Language Models (FLMs) by incorporating masking via a continuous stochastic interpolant bridging partially masked and clean sequences. This enables conditional generation and a novel hybrid sampler that alternates continuous denoising steps with discrete unmasking of high-confidence tokens. The approach also allows lightweight adaptation of pretrained Masked Diffusion Models (MDMs) into MLFMs. Experiments on GSM8K and MT-Bench are presented to support the claim that this is the first demonstration of flow-based language models scaling to downstream reasoning and instruction-following tasks.

Significance. If the hybrid sampler preserves the continuous flow properties while enabling multi-step reasoning, the work would meaningfully connect the parallel-generation advantages of MDMs with the single-step distillation potential of FLMs. The lightweight adaptation mechanism from existing MDMs is a practical strength that could accelerate adoption. The evaluations on GSM8K and MT-Bench, if they include appropriate controls and ablations, would provide the first concrete evidence that flow-based models can handle tasks previously limited by full-token decoding requirements.

major comments (2)

[Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.
[Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.

minor comments (2)

[Abstract] The abstract claims 'for the first time' that flow-based models solve downstream tasks; this phrasing should be qualified with a precise citation to prior FLM work that attempted but failed on similar benchmarks.
[Preliminaries or method] Notation for the stochastic interpolant (e.g., how the masking schedule interacts with the continuous velocity field) should be introduced with an explicit equation rather than descriptive text only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional theoretical and methodological detail will strengthen the manuscript. We address each major comment below and will incorporate the requested clarifications in the revision.

read point-by-point responses

Referee: [Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.

Authors: We acknowledge that the manuscript does not supply an explicit invariance argument for the flow map under discrete interventions. The hybrid sampler applies unmasking only to high-confidence tokens while the stochastic interpolant continues to govern the continuous trajectories on remaining positions; this design is intended to avoid reintroducing per-position factorization. To address the concern directly, we will add a short derivation in Section 3 showing that the flow map remains well-defined after each intervention, because unmasking fixes endpoint values without modifying the learned vector field on the still-masked coordinates. This addition will provide the requested grounding for the GSM8K results. revision: yes
Referee: [Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.

Authors: The referee correctly notes that the current text does not specify the adaptation loss or the exact parameter count. In the revised manuscript we will expand the method section to state the precise loss (a flow-matching term on the continuous interpolant plus a masked-token cross-entropy term) and report that adaptation updates only the parameters of the newly introduced flow head (approximately 3 % of total model parameters), leaving the pretrained MDM backbone frozen. This detail will clarify that the conversion preserves the continuous flow structure rather than merely fine-tuning discrete components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on novel construction and external benchmarks

full rationale

The derivation introduces MLFMs by defining a continuous stochastic interpolant to bridge masked and clean sequences within FLMs, proposes a hybrid sampler alternating denoising and unmasking, and validates via evaluation on GSM8K/MT-Bench. None of these steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the architecture and results are presented as independent of the paper's own prior quantities. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields an incomplete ledger. The stochastic interpolant and the alternating sampler are the main new constructs; no explicit free parameters, axioms, or externally validated invented entities are stated.

invented entities (1)

Masked Language Flow Model (MLFM) no independent evidence
purpose: Bridge masking and continuous flows for conditional language generation
New model class introduced to overcome stated limitations of MDMs and FLMs; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 988 out tokens · 25277 ms · 2026-06-29T01:15:12.142057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages

[1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

Opencodeinstruct: A large-scale instruction tuning dataset for code llms

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025

arXiv 2025
[3]

Stochastic interpolants: A unifying framework for flows and diffusions

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research, 26 0 (209): 0 1--80, 2025

2025
[4]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

2021
[5]

How to build a consistency model: Learning flow maps via self-distillation

Nicholas Boffi, Michael Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. Advances in Neural Information Processing Systems, 38: 0 33346--33382, 2026

2026
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

Pith/arXiv arXiv 2020
[7]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

2022
[8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315--11325, 2022

2022
[9]

Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023. URL https://arxiv.org/abs/2208.04202

arXiv 2023
[10]

Langflow: Continuous diffusion rivals discrete in language modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling. arXiv preprint arXiv:2604.11748, 2026

Pith/arXiv arXiv 2026
[11]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

2023
[12]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[13]

Scaling categorical flow maps, 2026

Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, and Louis Béthune. Scaling categorical flow maps, 2026. URL https://arxiv.org/abs/2605.07820

Pith/arXiv arXiv 2026
[14]

Stochastic processes: From applications to theory

Pierre Del Moral and Spiridon Penev. Stochastic processes: From applications to theory. Chapman and Hall/CRC, 2017

2017
[15]

Implicit chain of thought reasoning via knowledge distillation

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

arXiv 2023
[16]

Beyond autoregression: Fast llms via self-distillation through time

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time. arXiv preprint arXiv:2410.21035, 2024

arXiv 2024
[17]

Diffusion language models

Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion-language.html, 2023. Accessed: 2026-01-25

2023
[18]

Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics

Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics. arXiv preprint arXiv:2606.08417, 2026

Pith/arXiv arXiv 2026
[19]

Mask-predict: Parallel decoding of conditional masked language models

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

work page doi:10.18653/v1/d19-1633 2019
[20]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[21]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020
[22]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1 0 (2): 0 3, 2022

2022
[23]

Elf: Embedded language flows, 2026

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows, 2026. URL https://arxiv.org/abs/2605.10938

Pith/arXiv arXiv 2026
[24]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

2021
[25]

Boffi, and Jinwoo Kim

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising, 2026. URL https://arxiv.org/abs/2602.16813

Pith/arXiv arXiv 2026
[26]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

2024
[27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

2023
[28]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[29]

Discrete diffusion modeling by estimating the ratios of the data distribution

Alex Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2024

Pith/arXiv arXiv 2024
[30]

Scaling up masked diffusion models on text

Shengqi Nie, Fenglin Zhu, Chengpeng Du, Tianyu Pang, Qi Liu, Gang Zeng, Min Lin, and Chenguang Li. Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514, 2025 a

arXiv 2025
[31]

Large language diffusion models

Shengqi Nie, Fenglin Zhu, Zhen You, Xin Zhang, Jing Ou, Jing Hu, Jun Zhou, Yichang Lin, Ji-Rong Wen, and Chenguang Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

Pith/arXiv arXiv 2025
[32]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

2021
[33]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. In International Conference on Learning Representations, volume 2025, pages 64972--65009, 2025

2025
[34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195--4205, 2023

2023
[35]

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S. Albergo. Discrete flow maps, 2026. URL https://arxiv.org/abs/2604.09784

Pith/arXiv arXiv 2026
[36]

Candi: Hybrid discrete-continuous diffusion models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025

arXiv 2025
[37]

Categorical flow maps, 2026

Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL https://arxiv.org/abs/2602.12233

arXiv 2026
[38]

Simple and effective masked diffusion language models

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

2024
[39]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

2024
[40]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

2023
[41]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

2021
[42]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211--32252. PMLR, 2023

2023
[43]

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[44]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[45]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In International Conference on Learning Representations, volume 2024, pages 45040--45061, 2024

2024
[46]

Continuously augmented discrete diffusion model for categorical generative modeling

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. arXiv preprint arXiv:2510.01329, 2025

arXiv 2025
[47]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024

arXiv 2024
[48]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

2023
[49]

Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026. URL https://arxiv.org/abs/2510.03206

Pith/arXiv arXiv 2026

[1] [1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

Opencodeinstruct: A large-scale instruction tuning dataset for code llms

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025

arXiv 2025

[3] [3]

Stochastic interpolants: A unifying framework for flows and diffusions

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research, 26 0 (209): 0 1--80, 2025

2025

[4] [4]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

2021

[5] [5]

How to build a consistency model: Learning flow maps via self-distillation

Nicholas Boffi, Michael Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. Advances in Neural Information Processing Systems, 38: 0 33346--33382, 2026

2026

[6] [6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

Pith/arXiv arXiv 2020

[7] [7]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

2022

[8] [8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315--11325, 2022

2022

[9] [9]

Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023. URL https://arxiv.org/abs/2208.04202

arXiv 2023

[10] [10]

Langflow: Continuous diffusion rivals discrete in language modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling. arXiv preprint arXiv:2604.11748, 2026

Pith/arXiv arXiv 2026

[11] [11]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

2023

[12] [12]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[13] [13]

Scaling categorical flow maps, 2026

Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, and Louis Béthune. Scaling categorical flow maps, 2026. URL https://arxiv.org/abs/2605.07820

Pith/arXiv arXiv 2026

[14] [14]

Stochastic processes: From applications to theory

Pierre Del Moral and Spiridon Penev. Stochastic processes: From applications to theory. Chapman and Hall/CRC, 2017

2017

[15] [15]

Implicit chain of thought reasoning via knowledge distillation

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

arXiv 2023

[16] [16]

Beyond autoregression: Fast llms via self-distillation through time

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time. arXiv preprint arXiv:2410.21035, 2024

arXiv 2024

[17] [17]

Diffusion language models

Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion-language.html, 2023. Accessed: 2026-01-25

2023

[18] [18]

Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics

Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics. arXiv preprint arXiv:2606.08417, 2026

Pith/arXiv arXiv 2026

[19] [19]

Mask-predict: Parallel decoding of conditional masked language models

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

work page doi:10.18653/v1/d19-1633 2019

[20] [20]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[21] [21]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020

[22] [22]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1 0 (2): 0 3, 2022

2022

[23] [23]

Elf: Embedded language flows, 2026

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows, 2026. URL https://arxiv.org/abs/2605.10938

Pith/arXiv arXiv 2026

[24] [24]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

2021

[25] [25]

Boffi, and Jinwoo Kim

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising, 2026. URL https://arxiv.org/abs/2602.16813

Pith/arXiv arXiv 2026

[26] [26]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

2024

[27] [27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

2023

[28] [28]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[29] [29]

Discrete diffusion modeling by estimating the ratios of the data distribution

Alex Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2024

Pith/arXiv arXiv 2024

[30] [30]

Scaling up masked diffusion models on text

Shengqi Nie, Fenglin Zhu, Chengpeng Du, Tianyu Pang, Qi Liu, Gang Zeng, Min Lin, and Chenguang Li. Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514, 2025 a

arXiv 2025

[31] [31]

Large language diffusion models

Shengqi Nie, Fenglin Zhu, Zhen You, Xin Zhang, Jing Ou, Jing Hu, Jun Zhou, Yichang Lin, Ji-Rong Wen, and Chenguang Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

Pith/arXiv arXiv 2025

[32] [32]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

2021

[33] [33]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. In International Conference on Learning Representations, volume 2025, pages 64972--65009, 2025

2025

[34] [34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195--4205, 2023

2023

[35] [35]

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S. Albergo. Discrete flow maps, 2026. URL https://arxiv.org/abs/2604.09784

Pith/arXiv arXiv 2026

[36] [36]

Candi: Hybrid discrete-continuous diffusion models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025

arXiv 2025

[37] [37]

Categorical flow maps, 2026

Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL https://arxiv.org/abs/2602.12233

arXiv 2026

[38] [38]

Simple and effective masked diffusion language models

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

2024

[39] [39]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

2024

[40] [40]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

2023

[41] [41]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

2021

[42] [42]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211--32252. PMLR, 2023

2023

[43] [43]

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[44] [44]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022

[45] [45]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In International Conference on Learning Representations, volume 2024, pages 45040--45061, 2024

2024

[46] [46]

Continuously augmented discrete diffusion model for categorical generative modeling

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. arXiv preprint arXiv:2510.01329, 2025

arXiv 2025

[47] [47]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024

arXiv 2024

[48] [48]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

2023

[49] [49]

Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026. URL https://arxiv.org/abs/2510.03206

Pith/arXiv arXiv 2026