pith. machine review for the scientific record. sign in

arxiv: 2605.07924 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Amin Karimi Monsefi, Dominic Culver, Irina Belousova, Manuel R. Ciosici, Nikhil Bhendawade, Yizhe Zhang

Pith reviewed 2026-05-11 03:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords discrete flow matchingtrajectory distillationenergy-guided navigationfew-step generationlanguage modelingtext generationdistillation
0
0 comments X

The pith

Guided energy selection during trajectory construction lets an 8-step discrete flow student outperform its 1024-step teacher while running 128 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that poor trajectory quality, not limited student capacity, is the main barrier to effective few-step discrete flow matching for text. Standard distillation builds trajectories through unguided stochastic jumps, so early mistakes propagate and the student must copy flawed paths. TS-DFM inserts a lightweight energy compass at each midpoint to pick the most coherent continuation from candidates, improving the training signal without changing inference cost. On 170M-parameter language modeling this produces an 8-step model with 32 percent lower perplexity than the original teacher and consistent gains across distributions and evaluators. The approach also beats other discrete baselines even when those use more training data or larger models.

Core claim

Discrete flow matching trajectories formed by blind jumps can be reshaped at training time by an energy compass that selects coherent midpoints, allowing the distilled few-step student to exceed the perplexity of the original multi-step teacher on language modeling tasks.

What carries the argument

Trajectory-Shaped Discrete Flow Matching (TS-DFM) with a lightweight energy compass that evaluates and selects high-coherence continuations at trajectory midpoints.

If this is right

  • An 8-step discrete generator can deliver lower perplexity than a 1024-step teacher on language modeling while delivering 128 times faster inference.
  • Trajectory quality improvements during training transfer to better performance across source distributions and evaluators of different scales.
  • TS-DFM sets a new state-of-the-art perplexity among compared discrete generation methods, including those using six times more data or five times larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same midpoint navigation idea could be tested in continuous flow or diffusion models where trajectory quality also limits few-step performance.
  • If the compass works because it rewards local coherence, it might generalize to other sequential generation tasks such as code or molecular sequences.
  • Training cost rises modestly from the extra compass evaluations, but the resulting student models could enable deployment on resource-constrained devices.

Load-bearing premise

The energy compass can identify genuinely coherent continuations at midpoints without introducing biases or overfitting that would fail to generalize to new data or different model sizes.

What would settle it

If an 8-step TS-DFM student records higher perplexity than the 1024-step teacher on held-out text from the same distribution, the claim that shaped trajectories produce superior few-step performance would be refuted.

read the original abstract

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Trajectory-Shaped Discrete Flow Matching (TS-DFM) to improve few-step discrete flow matching for text generation. It argues that poor-quality training trajectories (built from blind stochastic jumps) are the main bottleneck in distillation, not student capacity. TS-DFM introduces a lightweight energy compass that evaluates and selects coherent candidate continuations at midpoints during trajectory construction; shaping occurs only at training time. On 170M-parameter language models, the resulting 8-step student achieves 32% lower perplexity than the 1024-step teacher (128x faster) and outperforms other discrete-generation baselines, with gains consistent across source distributions and three evaluators.

Significance. If the central empirical claim holds after clarifying the energy compass, the result would be significant for efficient discrete generative modeling: it shows that trajectory quality can be improved at training time to yield students that outperform their multi-step teachers, offering a path to accelerate sampling in flow matching without changing inference cost. The consistency across evaluators and source distributions, if robust, strengthens the case that guided navigation addresses a fundamental limitation rather than a capacity issue.

major comments (3)
  1. [Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.
  2. [§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.
  3. [§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.
minor comments (2)
  1. [§3] Notation for the energy function and candidate scoring is introduced without a clear equation or algorithm box, making reproduction difficult.
  2. [Abstract] The abstract claims 'gains consistent across source distributions' but the corresponding table or figure does not list the exact distributions or sample sizes used for each.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and analysis will strengthen the manuscript. We agree that the energy compass requires explicit documentation and that statistical robustness checks are warranted. We will revise the paper to incorporate these elements while preserving the core claims. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.

    Authors: We acknowledge that the current manuscript does not supply an explicit definition, feature set, or pseudocode for the energy compass. In the revised version we will add a complete specification of the compass (including the coherence and entropy features it evaluates), the selection algorithm in pseudocode, and a dedicated ablation that compares guided versus unguided trajectories while holding teacher logits fixed. This will directly address whether the student learns a superior mapping or merely an easier path. revision: yes

  2. Referee: [§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.

    Authors: The compass is a fixed, parameter-free heuristic that scores candidate continuations using only intrinsic sequence statistics (local token energy and short-range coherence) and does not access teacher logits, learned parameters, or training data beyond the current trajectory. All evaluations occur exclusively during training-time trajectory construction. We will expand §3.2 with the exact computation procedure and a short verification that no parameters are added and no information leaks to inference or downstream evaluators. revision: yes

  3. Referee: [§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.

    Authors: We agree that statistical significance tests, seed-wise variance, and explicit controls for design choices are missing and will add them in the revision (including p-values and standard deviations over multiple runs). The energy function is parameter-free by construction, depending solely on general sequence properties rather than model-specific logits; we will include a short analysis confirming its lack of correlation with the three evaluator families used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons

full rationale

The paper's core contribution is an empirical distillation procedure (TS-DFM) that replaces blind stochastic jumps in discrete flow matching trajectories with guided selection via a lightweight energy compass at training time only. All reported performance numbers (32% perplexity reduction at 8 steps vs. 1024-step teacher, consistency across distributions and evaluators, superiority to baselines trained on more data or larger models) are obtained from direct experimental comparisons against independent baselines and multiple external evaluators. No equations, fitted parameters, or self-citations are invoked to derive the performance gains by construction; the energy compass is presented as an auxiliary training heuristic whose effect is measured rather than assumed. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The method introduces an energy compass whose precise form, parameters, and training are not detailed in the abstract, creating an implicit dependency on an unexamined coherence evaluator.

invented entities (1)
  • energy compass no independent evidence
    purpose: Lightweight evaluator that scores candidate continuations at trajectory midpoints to select coherent paths during training
    New component introduced to replace blind jumps; no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1274 out tokens · 48810 ms · 2026-05-11T03:05:40.799359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 6 internal anchors

  1. [1]

    and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =

    Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , url =

  2. [2]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Universal Guidance for Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  3. [3]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  4. [4]

    International Conference on Learning Representations (ICLR) , year=

    Residual Energy-Based Models for Text Generation , author=. International Conference on Learning Representations (ICLR) , year=

  5. [5]

    Diffusion Models Beat GANs on Image Synthesis , url =

    Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion Models Beat GANs on Image Synthesis , url =

  6. [6]

    and Dieleman, Sander and Fergus, Rob and Sohl-Dickstein, Jascha and Doucet, Arnaud and Grathwohl, Will Sussman , booktitle =

    Du, Yilun and Durkan, Conor and Strudel, Robin and Tenenbaum, Joshua B. and Dieleman, Sander and Fergus, Rob and Sohl-Dickstein, Jascha and Doucet, Arnaud and Grathwohl, Will Sussman , booktitle =. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and. 2023 , editor =

  7. [7]

    Compositional Visual Generation with Composable Diffusion Models

    Liu, Nan and Li, Shuang and Du, Yilun and Torralba, Antonio and Tenenbaum, Joshua B. Compositional Visual Generation with Composable Diffusion Models. Computer Vision -- ECCV 2022. 2022

  8. [8]

    The Thirteenth International Conference on Learning Representations , year=

    One Step Diffusion via Shortcut Models , author=. The Thirteenth International Conference on Learning Representations , year=

  9. [11]

    2015 , eprint=

    Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

  10. [12]

    2026 , url=

    Amin Karimi Monsefi and Nikhil Bhendawade and Manuel Rafael Ciosici and Dominic Culver and Yizhe Zhang and Irina Belousova , booktitle=. 2026 , url=

  11. [13]

    International Conference on Learning Representations , year=

    A Distributional Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=

  12. [14]

    Predicting Structured Data , publisher=

    A Tutorial on Energy-Based Learning , author=. Predicting Structured Data , publisher=

  13. [15]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    On Distillation of Guided Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  14. [16]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Large Language Diffusion Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  15. [17]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

    Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

  16. [18]

    International Conference on Learning Representations , year=

    Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=

  17. [19]

    2021 , eprint=

    How to Train Your Energy-Based Models , author=. 2021 , eprint=

  18. [20]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Consistency Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  19. [21]

    2025 , eprint=

    Dream 7B: Diffusion Large Language Models , author=. 2025 , eprint=

  20. [22]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    One-Step Diffusion with Distribution Matching Distillation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  21. [25]

    Beyond Autoregression: Fast

    Justin Deschenaux and Caglar Gulcehre , booktitle=. Beyond Autoregression: Fast. 2025 , url=

  22. [26]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Lou, Aaron and Meng, Chenlin and Ermon, Stefano , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  23. [27]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  24. [28]

    Proceedings of the 42nd International Conference on Machine Learning , pages =

    The Diffusion Duality , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

  25. [29]

    Second Conference on Language Modeling , year=

    Unifying Autoregressive and Diffusion-Based Sequence Generation , author=. Second Conference on Language Modeling , year=

  26. [30]

    2025 , eprint=

    Dream-Coder 7B: An Open Diffusion Language Model for Code , author=. 2025 , eprint=

  27. [31]

    2025 , eprint=

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference , author=. 2025 , eprint=

  28. [32]

    The Thirteenth International Conference on Learning Representations , year=

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  29. [33]

    2026 , url=

    Ling Yang and Ye Tian and Bowen Li and Xinchen Zhang and Ke Shen and Yunhai Tong and Mengdi Wang , booktitle=. 2026 , url=

  30. [35]

    International Conference on Learning Representations , year=

    Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

  31. [36]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  32. [37]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  33. [38]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  34. [39]

    Advances in

    Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid , year =. Advances in

  35. [40]

    2026 , eprint=

    T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization , author=. 2026 , eprint=

  36. [41]

    2024 , eprint=

    EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling , author=. 2024 , eprint=

  37. [42]

    Proceedings of the 42nd International Conference on Machine Learning , pages =

    Optimizing Temperature for Language Models with Multi-Sample Inference , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

  38. [45]

    Block diffusion: Interpolating between autoregressive and diffusion language models

    Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

  39. [46]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17981--17993. Curran Associates, Inc., 2021. URL ht...

  40. [47]

    Universal guidance for diffusion models

    Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  41. [48]

    Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Internati...

  42. [49]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  43. [50]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  44. [51]

    Residual energy-based models for text generation

    Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc'Aurelio Ranzato. Residual energy-based models for text generation. In International Conference on Learning Representations (ICLR), 2020

  45. [52]

    Beyond autoregression: Fast LLM s via self-distillation through time

    Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLM s via self-distillation through time. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=uZ5K4HeNwd

  46. [53]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780--8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1...

  47. [54]

    Optimizing temperature for language models with multi-sample inference

    Weihua Du, Yiming Yang, and Sean Welleck. Optimizing temperature for language models with multi-sample inference. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine ...

  48. [55]

    Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl

    Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, ...

  49. [56]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

  50. [57]

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 133345--133385. Curran Associates, Inc., 2024. doi:10.52202...

  51. [58]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  52. [59]

    D iffusion BERT : Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. D iffusion BERT : Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45...

  53. [60]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

  54. [61]

    A distributional approach to controlled text generation

    Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jWkw45-9AbL

  55. [62]

    A tutorial on energy-based learning

    Yann LeCun, Sumit Chopra, Raia Hadsell, Marc'Aurelio Ranzato, and Fu Jie Huang. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press, 2006

  56. [63]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision -- ECCV 2022, pages 423--439, Cham, 2022. Springer Nature Switzerland

  57. [64]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  58. [65]

    On distillation of guided diffusion models

    Chenlin Meng et al. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  59. [66]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe

  60. [67]

    FS - DFM : Fast and accurate long text generation with few-step diffusion language models

    Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. FS - DFM : Fast and accurate long text generation with few-step diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ue1zFeD275

  61. [68]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=KnqiC0znVF

  62. [69]

    Michael Polanyi.The Tacit Dimension

    Guilherme Penedo, Hynek Kydl\' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing System...

  63. [70]

    MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers . In Advances in Neural Information Processing Systems , volume 34, pages 4816--4828. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pa...

  64. [71]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  65. [72]

    Simple and effective masked diffusion language models

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages...

  66. [73]

    The diffusion duality

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and Volodymyr Kuleshov. The diffusion duality. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...

  67. [74]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI

  68. [75]

    Yang Song and Diederik P. Kingma. How to train your energy-based models, 2021. URL https://arxiv.org/abs/2101.03288

  69. [76]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211--32252. PMLR, 23--29 J...

  70. [77]

    Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...

  71. [78]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  72. [79]

    Remasking discrete diffusion models with inference-time scaling

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=IJryQAOy0p

  73. [80]

    18 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

    Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream-coder 7b: An open diffusion language model for code, 2025. URL https://arxiv.org/abs/2509.01142

  74. [81]

    MM a DA : Multimodal large diffusion language models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MM a DA : Multimodal large diffusion language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=wczmXLuLGd

  75. [82]

    Diffusion of thought: Chain-of-thought reasoning in diffusion language models

    Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, and Lingpeng Kong. Diffusion of thought: Chain-of-thought reasoning in diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, ...

  76. [83]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487

  77. [84]

    One-step diffusion with distribution matching distillation

    Tianwei Yin et al. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  78. [85]

    arXiv preprint arXiv:2403.14541

    Shimao Zhang, Yu Bao, and Shujian Huang. Edt: Improving large language models' generation by entropy-based dynamic temperature sampling, 2024. URL https://arxiv.org/abs/2403.14541

  79. [86]

    Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, and Dimitris N. Metaxas. T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization, 2026. URL https://arxiv.org/abs/2602.12262

  80. [87]

    Texygen: A benchmarking platform for text generation models

    Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '18, page 1097–1100, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356572. doi:1...