pith. sign in

arxiv: 2606.27617 · v1 · pith:5VNZEUMTnew · submitted 2026-06-26 · 💻 cs.CL · cs.LG

Masked Language Flow Models

Pith reviewed 2026-06-29 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords masked language flow modelsflow language modelsmasked diffusion modelsstochastic interpolantreasoning tasksinstruction followingcontinuous flowsalternating sampler
0
0 comments X

The pith

Masked Language Flow Models combine masking with continuous flows so language models can perform multi-step reasoning without decoding every token upfront.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous flow language models learn a continuous transport from noise to clean sequences but must decode all tokens at once, which hinders tasks that need iterative reasoning. Masked Language Flow Models add a continuous stochastic interpolant that connects partially masked sequences to full ones, letting the model generate conditionally and unmask tokens selectively during sampling. A new procedure alternates continuous denoising steps with discrete unmasking of high-confidence tokens. On GSM8K and MT-Bench this produces the first reported success of flow-based models on reasoning and instruction-following benchmarks.

Core claim

MLFMs extend flow language models by inserting masking through a continuous stochastic interpolant that transports between partially masked and clean token sequences in Euclidean space. The resulting flow map supports conditional generation, admits lightweight conversion from pretrained masked diffusion models, and pairs with an alternating sampler that interleaves continuous denoising and discrete unmasking of confident tokens.

What carries the argument

The continuous stochastic interpolant that bridges partially masked sequences and clean sequences so the learned flow supports selective unmasking.

If this is right

  • Pretrained masked diffusion models convert to MLFMs with only lightweight adaptation.
  • Continuous flows now support conditional generation without forcing full-token decoding at every step.
  • The alternating sampler of continuous denoising and discrete unmasking enables multi-step reasoning.
  • Flow-based models reach usable performance on GSM8K math reasoning and MT-Bench instruction following.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid continuous-discrete sampler may extend naturally to other sequence tasks that mix local certainty with global structure.
  • Because the conversion from masked diffusion models is lightweight, existing diffusion checkpoints become immediate starting points for flow-based reasoning systems.
  • The approach suggests that future language models could switch between fully continuous and partially discrete regimes depending on the reasoning depth required.

Load-bearing premise

The continuous stochastic interpolant creates a usable bridge between masked and clean sequences that preserves the advantages of flow while enabling conditional, multi-step generation.

What would settle it

If the resulting models show no improvement over prior flow language models on GSM8K accuracy or MT-Bench win rates, the claim that masking plus interpolation overcomes the token-decoding barrier would be refuted.

Figures

Figures reproduced from arXiv: 2606.27617 by Iskander Azangulov, Kianoosh Ashouritaklimi, Leo Zhang, Patrick Rebeschini, Simon Vary.

Figure 1
Figure 1. Figure 1: MT-Bench and GSM8K results across different guidance scales (left) and sampler steps (right). [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative example of MLFM performing conditional generation on MT-Bench. [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative example of MLFM performing conditional generation on GSM8K. [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
read the original abstract

Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions -- an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Masked Language Flow Models (MLFMs) that extend Flow Language Models (FLMs) by incorporating masking via a continuous stochastic interpolant bridging partially masked and clean sequences. This enables conditional generation and a novel hybrid sampler that alternates continuous denoising steps with discrete unmasking of high-confidence tokens. The approach also allows lightweight adaptation of pretrained Masked Diffusion Models (MDMs) into MLFMs. Experiments on GSM8K and MT-Bench are presented to support the claim that this is the first demonstration of flow-based language models scaling to downstream reasoning and instruction-following tasks.

Significance. If the hybrid sampler preserves the continuous flow properties while enabling multi-step reasoning, the work would meaningfully connect the parallel-generation advantages of MDMs with the single-step distillation potential of FLMs. The lightweight adaptation mechanism from existing MDMs is a practical strength that could accelerate adoption. The evaluations on GSM8K and MT-Bench, if they include appropriate controls and ablations, would provide the first concrete evidence that flow-based models can handle tasks previously limited by full-token decoding requirements.

major comments (2)
  1. [Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.
  2. [Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.
minor comments (2)
  1. [Abstract] The abstract claims 'for the first time' that flow-based models solve downstream tasks; this phrasing should be qualified with a precise citation to prior FLM work that attempted but failed on similar benchmarks.
  2. [Preliminaries or method] Notation for the stochastic interpolant (e.g., how the masking schedule interacts with the continuous velocity field) should be introduced with an explicit equation rather than descriptive text only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional theoretical and methodological detail will strengthen the manuscript. We address each major comment below and will incorporate the requested clarifications in the revision.

read point-by-point responses
  1. Referee: [Section describing the novel sampler (likely §3)] The central claim that the hybrid sampler supports multi-step reasoning without reintroducing token-position factorization issues (as in MDMs) or full decoding requirements (as in FLMs) rests on the continuous stochastic interpolant successfully bridging the discrete unmasking steps. No derivation or invariance argument is provided showing that the flow map remains well-defined after each discrete intervention; this is load-bearing for the GSM8K results.

    Authors: We acknowledge that the manuscript does not supply an explicit invariance argument for the flow map under discrete interventions. The hybrid sampler applies unmasking only to high-confidence tokens while the stochastic interpolant continues to govern the continuous trajectories on remaining positions; this design is intended to avoid reintroducing per-position factorization. To address the concern directly, we will add a short derivation in Section 3 showing that the flow map remains well-defined after each intervention, because unmasking fixes endpoint values without modifying the learned vector field on the still-masked coordinates. This addition will provide the requested grounding for the GSM8K results. revision: yes

  2. Referee: [Method section on adaptation procedure] The abstract states that pretrained MDMs can be converted into MLFMs 'through a simple, lightweight adaptation.' Without the precise loss formulation or the number of additional parameters updated during adaptation, it is impossible to assess whether this conversion preserves the flow properties or merely fine-tunes a discrete component.

    Authors: The referee correctly notes that the current text does not specify the adaptation loss or the exact parameter count. In the revised manuscript we will expand the method section to state the precise loss (a flow-matching term on the continuous interpolant plus a masked-token cross-entropy term) and report that adaptation updates only the parameters of the newly introduced flow head (approximately 3 % of total model parameters), leaving the pretrained MDM backbone frozen. This detail will clarify that the conversion preserves the continuous flow structure rather than merely fine-tuning discrete components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on novel construction and external benchmarks

full rationale

The derivation introduces MLFMs by defining a continuous stochastic interpolant to bridge masked and clean sequences within FLMs, proposes a hybrid sampler alternating denoising and unmasking, and validates via evaluation on GSM8K/MT-Bench. None of these steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the architecture and results are presented as independent of the paper's own prior quantities. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields an incomplete ledger. The stochastic interpolant and the alternating sampler are the main new constructs; no explicit free parameters, axioms, or externally validated invented entities are stated.

invented entities (1)
  • Masked Language Flow Model (MLFM) no independent evidence
    purpose: Bridge masking and continuous flows for conditional language generation
    New model class introduced to overcome stated limitations of MDMs and FLMs; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 988 out tokens · 25277 ms · 2026-06-29T01:15:12.142057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages

  1. [1]

    Gpt-4 technical report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Opencodeinstruct: A large-scale instruction tuning dataset for code llms

    Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. Opencodeinstruct: A large-scale instruction tuning dataset for code llms. arXiv preprint arXiv:2504.04030, 2025

  3. [3]

    Stochastic interpolants: A unifying framework for flows and diffusions

    Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research, 26 0 (209): 0 1--80, 2025

  4. [4]

    Structured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

  5. [5]

    How to build a consistency model: Learning flow maps via self-distillation

    Nicholas Boffi, Michael Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. Advances in Neural Information Processing Systems, 38: 0 33346--33382, 2026

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  7. [7]

    A continuous time framework for discrete denoising models

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

  8. [8]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315--11325, 2022

  9. [9]

    Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023

    Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023. URL https://arxiv.org/abs/2208.04202

  10. [10]

    Langflow: Continuous diffusion rivals discrete in language modeling

    Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling. arXiv preprint arXiv:2604.11748, 2026

  11. [11]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

  12. [12]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    Scaling categorical flow maps, 2026

    Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, and Louis Béthune. Scaling categorical flow maps, 2026. URL https://arxiv.org/abs/2605.07820

  14. [14]

    Stochastic processes: From applications to theory

    Pierre Del Moral and Spiridon Penev. Stochastic processes: From applications to theory. Chapman and Hall/CRC, 2017

  15. [15]

    Implicit chain of thought reasoning via knowledge distillation

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

  16. [16]

    Beyond autoregression: Fast llms via self-distillation through time

    Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time. arXiv preprint arXiv:2410.21035, 2024

  17. [17]

    Diffusion language models

    Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion-language.html, 2023. Accessed: 2026-01-25

  18. [18]

    Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics

    Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text evaluation needs distributional metrics. arXiv preprint arXiv:2606.08417, 2026

  19. [19]

    Mask-predict: Parallel decoding of conditional masked language models

    Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pr...

  20. [20]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  21. [21]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  22. [22]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1 0 (2): 0 3, 2022

  23. [23]

    Elf: Embedded language flows, 2026

    Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows, 2026. URL https://arxiv.org/abs/2605.10938

  24. [24]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34: 0 21696--21707, 2021

  25. [25]

    Boffi, and Jinwoo Kim

    Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising, 2026. URL https://arxiv.org/abs/2602.16813

  26. [26]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

  27. [27]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

  28. [28]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  29. [29]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Alex Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2024

  30. [30]

    Scaling up masked diffusion models on text

    Shengqi Nie, Fenglin Zhu, Chengpeng Du, Tianyu Pang, Qi Liu, Gang Zeng, Min Lin, and Chenguang Li. Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514, 2025 a

  31. [31]

    Large language diffusion models

    Shengqi Nie, Fenglin Zhu, Zhen You, Xin Zhang, Jing Ou, Jing Hu, Jun Zhou, Yichang Lin, Ji-Rong Wen, and Chenguang Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

  32. [32]

    Show your work: Scratchpads for intermediate computation with language models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

  33. [33]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. In International Conference on Learning Representations, volume 2025, pages 64972--65009, 2025

  34. [34]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195--4205, 2023

  35. [35]

    Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S. Albergo. Discrete flow maps, 2026. URL https://arxiv.org/abs/2604.09784

  36. [36]

    Candi: Hybrid discrete-continuous diffusion models

    Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025

  37. [37]

    Categorical flow maps, 2026

    Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, İsmail İlkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL https://arxiv.org/abs/2602.12233

  38. [38]

    Simple and effective masked diffusion language models

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

  39. [39]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

  40. [40]

    SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

  41. [41]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

  42. [42]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211--32252. PMLR, 2023

  43. [43]

    Llama 2: Open foundation and fine-tuned chat models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  44. [44]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  45. [45]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In International Conference on Learning Representations, volume 2024, pages 45040--45061, 2024

  46. [46]

    Continuously augmented discrete diffusion model for categorical generative modeling

    Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. arXiv preprint arXiv:2510.01329, 2025

  47. [47]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024

  48. [48]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

  49. [49]

    Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026

    Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner, 2026. URL https://arxiv.org/abs/2510.03206