Masked Language Flow Models
Pith reviewed 2026-06-29 01:15 UTC · model grok-4.3
The pith
Masked Language Flow Models combine masking with continuous flows so language models can perform multi-step reasoning without decoding every token upfront.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLFMs extend flow language models by inserting masking through a continuous stochastic interpolant that transports between partially masked and clean token sequences in Euclidean space. The resulting flow map supports conditional generation, admits lightweight conversion from pretrained masked diffusion models, and pairs with an alternating sampler that interleaves continuous denoising and discrete unmasking of confident tokens.
What carries the argument
The continuous stochastic interpolant that bridges partially masked sequences and clean sequences so the learned flow supports selective unmasking.
If this is right
- Pretrained masked diffusion models convert to MLFMs with only lightweight adaptation.
- Continuous flows now support conditional generation without forcing full-token decoding at every step.
- The alternating sampler of continuous denoising and discrete unmasking enables multi-step reasoning.
- Flow-based models reach usable performance on GSM8K math reasoning and MT-Bench instruction following.
Where Pith is reading between the lines
- The hybrid continuous-discrete sampler may extend naturally to other sequence tasks that mix local certainty with global structure.
- Because the conversion from masked diffusion models is lightweight, existing diffusion checkpoints become immediate starting points for flow-based reasoning systems.
- The approach suggests that future language models could switch between fully continuous and partially discrete regimes depending on the reasoning depth required.
Load-bearing premise
The continuous stochastic interpolant creates a usable bridge between masked and clean sequences that preserves the advantages of flow while enabling conditional, multi-step generation.
What would settle it
If the resulting models show no improvement over prior flow language models on GSM8K accuracy or MT-Bench win rates, the claim that masking plus interpolation overcomes the token-decoding barrier would be refuted.
Figures
read the original abstract
Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions -- an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Masked Language Flow Models (MLFMs) that extend Flow Language Models (FLMs) by incorporating masking via a continuous stochastic interpolant bridging partially masked and clean sequences. This enables conditional generation and a novel hybrid sampler that alternates continuous denoising steps with discrete unmasking of high-confidence tokens. The approach also allows lightweight adaptation of pretrained Masked Diffusion Models (MDMs) into MLFMs. Experiments on GSM8K and MT-Bench are presented to support the claim that this is the first demonstration of flow-based language models scaling to downstream reasoning and instruction-following tasks.
Significance. If the hybrid sampler preserves the continuous flow properties while enabling multi-step reasoning, the work would meaningfully connect the parallel-generation advantages of MDMs with the single-step distillation potential of FLMs. The lightweight adaptation mechanism from existing MDMs is a practical strength that could accelerate adoption. The evaluations on GSM8K and MT-Bench, if they include appropriate controls and ablations, would provide the first concrete evidence that flow-based models can handle tasks previously limited by full-token decoding requirements.
major comments (2)
- [Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.
- [Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.
minor comments (2)
- [Abstract] The abstract claims 'for the first time' that flow-based models solve downstream tasks; this phrasing should be qualified with a precise citation to prior FLM work that attempted but failed on similar benchmarks.
- [Preliminaries or method] Notation for the stochastic interpolant (e.g., how the masking schedule interacts with the continuous velocity field) should be introduced with an explicit equation rather than descriptive text only.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which identify key areas where additional theoretical and methodological detail will strengthen the manuscript. We address each major comment below and will incorporate the requested clarifications in the revision.
read point-by-point responses
-
Referee: [Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.
Authors: We acknowledge that the manuscript does not supply an explicit invariance argument for the flow map under discrete interventions. The hybrid sampler applies unmasking only to high-confidence tokens while the stochastic interpolant continues to govern the continuous trajectories on remaining positions; this design is intended to avoid reintroducing per-position factorization. To address the concern directly, we will add a short derivation in Section 3 showing that the flow map remains well-defined after each intervention, because unmasking fixes endpoint values without modifying the learned vector field on the still-masked coordinates. This addition will provide the requested grounding for the GSM8K results. revision: yes
-
Referee: [Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.
Authors: The referee correctly notes that the current text does not specify the adaptation loss or the exact parameter count. In the revised manuscript we will expand the method section to state the precise loss (a flow-matching term on the continuous interpolant plus a masked-token cross-entropy term) and report that adaptation updates only the parameters of the newly introduced flow head (approximately 3 % of total model parameters), leaving the pretrained MDM backbone frozen. This detail will clarify that the conversion preserves the continuous flow structure rather than merely fine-tuning discrete components. revision: yes
Circularity Check
No significant circularity; claims rest on novel construction and external benchmarks
full rationale
The derivation introduces MLFMs by defining a continuous stochastic interpolant to bridge masked and clean sequences within FLMs, proposes a hybrid sampler alternating denoising and unmasking, and validates via evaluation on GSM8K/MT-Bench. None of these steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the architecture and results are presented as independent of the paper's own prior quantities. This matches the default expectation for non-circular papers.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Masked Language Flow Model (MLFM)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[2]
Opencodeinstruct: A large-scale instruction tuning dataset for code llms
Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025
arXiv 2025
-
[3]
Stochastic interpolants: A unifying framework for flows and diffusions
Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research, 26 0 (209): 0 1--80, 2025
2025
-
[4]
Structured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021
2021
-
[5]
How to build a consistency model: Learning flow maps via self-distillation
Nicholas Boffi, Michael Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. Advances in Neural Information Processing Systems, 38: 0 33346--33382, 2026
2026
-
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
Pith/arXiv arXiv 2020
-
[7]
A continuous time framework for discrete denoising models
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022
2022
-
[8]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315--11325, 2022
2022
-
[9]
Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023. URL https://arxiv.org/abs/2208.04202
arXiv 2023
-
[10]
Langflow: Continuous diffusion rivals discrete in language modeling
Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling. arXiv preprint arXiv:2604.11748, 2026
Pith/arXiv arXiv 2026
-
[11]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/
2023
-
[12]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[13]
Scaling categorical flow maps, 2026
Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, and Louis Béthune. Scaling categorical flow maps, 2026. URL https://arxiv.org/abs/2605.07820
Pith/arXiv arXiv 2026
-
[14]
Stochastic processes: From applications to theory
Pierre Del Moral and Spiridon Penev. Stochastic processes: From applications to theory. Chapman and Hall/CRC, 2017
2017
-
[15]
Implicit chain of thought reasoning via knowledge distillation
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023
arXiv 2023
-
[16]
Beyond autoregression: Fast llms via self-distillation through time
Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time. arXiv preprint arXiv:2410.21035, 2024
arXiv 2024
-
[17]
Diffusion language models
Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion-language.html, 2023. Accessed: 2026-01-25
2023
-
[18]
Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics
Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics. arXiv preprint arXiv:2606.08417, 2026
Pith/arXiv arXiv 2026
-
[19]
Mask-predict: Parallel decoding of conditional masked language models
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...
-
[20]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[21]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020
2020
-
[22]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1 0 (2): 0 3, 2022
2022
-
[23]
Elf: Embedded language flows, 2026
Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows, 2026. URL https://arxiv.org/abs/2605.10938
Pith/arXiv arXiv 2026
-
[24]
Variational diffusion models
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021
2021
-
[25]
Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising, 2026. URL https://arxiv.org/abs/2602.16813
Pith/arXiv arXiv 2026
-
[26]
Numinamath
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...
2024
-
[27]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t
2023
-
[28]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[29]
Discrete diffusion modeling by estimating the ratios of the data distribution
Alex Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2024
Pith/arXiv arXiv 2024
-
[30]
Scaling up masked diffusion models on text
Shengqi Nie, Fenglin Zhu, Chengpeng Du, Tianyu Pang, Qi Liu, Gang Zeng, Min Lin, and Chenguang Li. Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514, 2025 a
arXiv 2025
-
[31]
Large language diffusion models
Shengqi Nie, Fenglin Zhu, Zhen You, Xin Zhang, Jing Ou, Jing Hu, Jun Zhou, Yichang Lin, Ji-Rong Wen, and Chenguang Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b
Pith/arXiv arXiv 2025
-
[32]
Show your work: Scratchpads for intermediate computation with language models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021
2021
-
[33]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. In International Conference on Learning Representations, volume 2025, pages 64972--65009, 2025
2025
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195--4205, 2023
2023
-
[35]
Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S. Albergo. Discrete flow maps, 2026. URL https://arxiv.org/abs/2604.09784
Pith/arXiv arXiv 2026
-
[36]
Candi: Hybrid discrete-continuous diffusion models
Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025
arXiv 2025
-
[37]
Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL https://arxiv.org/abs/2602.12233
arXiv 2026
-
[38]
Simple and effective masked diffusion language models
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024
2024
-
[39]
Simplified and generalized masked diffusion for discrete data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024
2024
-
[40]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B
2023
-
[41]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS
2021
-
[42]
Consistency models
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211--32252. PMLR, 2023
2023
-
[43]
Llama 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
Pith/arXiv arXiv 2023
-
[44]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
2022
-
[45]
Metamath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In International Conference on Learning Representations, volume 2024, pages 45040--45061, 2024
2024
-
[46]
Continuously augmented discrete diffusion model for categorical generative modeling
Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. arXiv preprint arXiv:2510.01329, 2025
arXiv 2025
-
[47]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024
arXiv 2024
-
[48]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023
2023
-
[49]
Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026. URL https://arxiv.org/abs/2510.03206
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.