arxiv: 2605.07924 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Amin Karimi Monsefi, Dominic Culver, Irina Belousova, Manuel R. Ciosici, Nikhil Bhendawade, Yizhe Zhang

Pith reviewed 2026-05-11 03:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords discrete flow matchingtrajectory distillationenergy-guided navigationfew-step generationlanguage modelingtext generationdistillation

0 comments

The pith

Guided energy selection during trajectory construction lets an 8-step discrete flow student outperform its 1024-step teacher while running 128 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that poor trajectory quality, not limited student capacity, is the main barrier to effective few-step discrete flow matching for text. Standard distillation builds trajectories through unguided stochastic jumps, so early mistakes propagate and the student must copy flawed paths. TS-DFM inserts a lightweight energy compass at each midpoint to pick the most coherent continuation from candidates, improving the training signal without changing inference cost. On 170M-parameter language modeling this produces an 8-step model with 32 percent lower perplexity than the original teacher and consistent gains across distributions and evaluators. The approach also beats other discrete baselines even when those use more training data or larger models.

Core claim

Discrete flow matching trajectories formed by blind jumps can be reshaped at training time by an energy compass that selects coherent midpoints, allowing the distilled few-step student to exceed the perplexity of the original multi-step teacher on language modeling tasks.

What carries the argument

Trajectory-Shaped Discrete Flow Matching (TS-DFM) with a lightweight energy compass that evaluates and selects high-coherence continuations at trajectory midpoints.

If this is right

An 8-step discrete generator can deliver lower perplexity than a 1024-step teacher on language modeling while delivering 128 times faster inference.
Trajectory quality improvements during training transfer to better performance across source distributions and evaluators of different scales.
TS-DFM sets a new state-of-the-art perplexity among compared discrete generation methods, including those using six times more data or five times larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same midpoint navigation idea could be tested in continuous flow or diffusion models where trajectory quality also limits few-step performance.
If the compass works because it rewards local coherence, it might generalize to other sequential generation tasks such as code or molecular sequences.
Training cost rises modestly from the extra compass evaluations, but the resulting student models could enable deployment on resource-constrained devices.

Load-bearing premise

The energy compass can identify genuinely coherent continuations at midpoints without introducing biases or overfitting that would fail to generalize to new data or different model sizes.

What would settle it

If an 8-step TS-DFM student records higher perplexity than the 1024-step teacher on held-out text from the same distribution, the claim that shaped trajectories produce superior few-step performance would be refuted.

read the original abstract

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shaping the distillation trajectories with a lightweight energy compass during training lets an 8-step student beat the 1024-step teacher by 32% perplexity on 170M language models while running 128x faster.

read the letter

The punchline here is that by replacing blind stochastic jumps with energy-guided choices when building the distillation trajectories, the authors get a student model that at 8 steps has substantially lower perplexity than the full 1024-step teacher on language modeling tasks, while running 128 times faster. What the paper does is take the standard distillation setup for discrete flow matching and add this navigation step at training time only. The idea is that early bad decisions in the trajectory hurt the student, so steering toward coherent sequences at midpoints helps the student learn a better few-step process. They report the shaped student outperforming the teacher and other baselines, with gains holding up across different source distributions and multiple evaluators of different sizes. This includes comparisons to methods that use more training data or larger models. The approach is straightforward and the empirical claims are concrete, with specific numbers on perplexity and speed. It engages with the idea that the trajectory quality matters more than just scaling the student capacity. The soft spots are around the energy compass itself. The abstract calls it lightweight, but I want to see the exact definition and whether it relies on features that could be correlated with the training set or the teacher model. If the compass introduces its own biases, the apparent superiority might not generalize as cleanly as claimed. The consistency across evaluators is reassuring, but without details on statistical tests or how candidates are scored, it's not fully locked down. The stress-test concern about embedding training-specific biases is worth checking against the full methods section. This paper is for folks working on speeding up iterative generative models, particularly in discrete settings like text generation. Anyone thinking about distillation or flow matching would find the trajectory-shaping angle useful to consider. I think it deserves to go to peer review. The results are strong enough on the surface to merit a closer look from referees who can dig into the energy function and reproducibility.

Referee Report

3 major / 2 minor

Summary. The paper proposes Trajectory-Shaped Discrete Flow Matching (TS-DFM) to improve few-step discrete flow matching for text generation. It argues that poor-quality training trajectories (built from blind stochastic jumps) are the main bottleneck in distillation, not student capacity. TS-DFM introduces a lightweight energy compass that evaluates and selects coherent candidate continuations at midpoints during trajectory construction; shaping occurs only at training time. On 170M-parameter language models, the resulting 8-step student achieves 32% lower perplexity than the 1024-step teacher (128x faster) and outperforms other discrete-generation baselines, with gains consistent across source distributions and three evaluators.

Significance. If the central empirical claim holds after clarifying the energy compass, the result would be significant for efficient discrete generative modeling: it shows that trajectory quality can be improved at training time to yield students that outperform their multi-step teachers, offering a path to accelerate sampling in flow matching without changing inference cost. The consistency across evaluators and source distributions, if robust, strengthens the case that guided navigation addresses a fundamental limitation rather than a capacity issue.

major comments (3)

[Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.
[§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.
[§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.

minor comments (2)

[§3] Notation for the energy function and candidate scoring is introduced without a clear equation or algorithm box, making reproduction difficult.
[Abstract] The abstract claims 'gains consistent across source distributions' but the corresponding table or figure does not list the exact distributions or sample sizes used for each.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and analysis will strengthen the manuscript. We agree that the energy compass requires explicit documentation and that statistical robustness checks are warranted. We will revise the paper to incorporate these elements while preserving the core claims. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.

Authors: We acknowledge that the current manuscript does not supply an explicit definition, feature set, or pseudocode for the energy compass. In the revised version we will add a complete specification of the compass (including the coherence and entropy features it evaluates), the selection algorithm in pseudocode, and a dedicated ablation that compares guided versus unguided trajectories while holding teacher logits fixed. This will directly address whether the student learns a superior mapping or merely an easier path. revision: yes
Referee: [§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.

Authors: The compass is a fixed, parameter-free heuristic that scores candidate continuations using only intrinsic sequence statistics (local token energy and short-range coherence) and does not access teacher logits, learned parameters, or training data beyond the current trajectory. All evaluations occur exclusively during training-time trajectory construction. We will expand §3.2 with the exact computation procedure and a short verification that no parameters are added and no information leaks to inference or downstream evaluators. revision: yes
Referee: [§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.

Authors: We agree that statistical significance tests, seed-wise variance, and explicit controls for design choices are missing and will add them in the revision (including p-values and standard deviations over multiple runs). The energy function is parameter-free by construction, depending solely on general sequence properties rather than model-specific logits; we will include a short analysis confirming its lack of correlation with the three evaluator families used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons

full rationale

The paper's core contribution is an empirical distillation procedure (TS-DFM) that replaces blind stochastic jumps in discrete flow matching trajectories with guided selection via a lightweight energy compass at training time only. All reported performance numbers (32% perplexity reduction at 8 steps vs. 1024-step teacher, consistency across distributions and evaluators, superiority to baselines trained on more data or larger models) are obtained from direct experimental comparisons against independent baselines and multiple external evaluators. No equations, fitted parameters, or self-citations are invoked to derive the performance gains by construction; the energy compass is presented as an auxiliary training heuristic whose effect is measured rather than assumed. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The method introduces an energy compass whose precise form, parameters, and training are not detailed in the abstract, creating an implicit dependency on an unexamined coherence evaluator.

invented entities (1)

energy compass no independent evidence
purpose: Lightweight evaluator that scores candidate continuations at trajectory midpoints to select coherent paths during training
New component introduced to replace blind jumps; no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1274 out tokens · 48810 ms · 2026-05-11T03:05:40.799359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
a lightweight energy compass—a scalar energy model trained on generation-aware negatives—to select the highest-quality one... Lenergy = L_NCE + λ_reg · L_reg + λ_order · L_order
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
Navigation shaping selects among candidates via an energy compass, activating at t ≥ τ

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 6 internal anchors

[1]

and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =

Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , url =

work page
[2]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Universal Guidance for Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[3]

Proceedings of the 41st International Conference on Machine Learning , pages =

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

work page 2024
[4]

International Conference on Learning Representations (ICLR) , year=

Residual Energy-Based Models for Text Generation , author=. International Conference on Learning Representations (ICLR) , year=

work page
[5]

Diffusion Models Beat GANs on Image Synthesis , url =

Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion Models Beat GANs on Image Synthesis , url =

work page
[6]

and Dieleman, Sander and Fergus, Rob and Sohl-Dickstein, Jascha and Doucet, Arnaud and Grathwohl, Will Sussman , booktitle =

Du, Yilun and Durkan, Conor and Strudel, Robin and Tenenbaum, Joshua B. and Dieleman, Sander and Fergus, Rob and Sohl-Dickstein, Jascha and Doucet, Arnaud and Grathwohl, Will Sussman , booktitle =. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and. 2023 , editor =

work page 2023
[7]

Compositional Visual Generation with Composable Diffusion Models

Liu, Nan and Li, Shuang and Du, Yilun and Torralba, Antonio and Tenenbaum, Joshua B. Compositional Visual Generation with Composable Diffusion Models. Computer Vision -- ECCV 2022. 2022

work page 2022
[8]

The Thirteenth International Conference on Learning Representations , year=

One Step Diffusion via Shortcut Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[11]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

work page 2015
[12]

2026 , url=

Amin Karimi Monsefi and Nikhil Bhendawade and Manuel Rafael Ciosici and Dominic Culver and Yizhe Zhang and Irina Belousova , booktitle=. 2026 , url=

work page 2026
[13]

International Conference on Learning Representations , year=

A Distributional Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=

work page
[14]

Predicting Structured Data , publisher=

A Tutorial on Energy-Based Learning , author=. Predicting Structured Data , publisher=

work page
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

On Distillation of Guided Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[16]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Large Language Diffusion Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[17]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

work page
[18]

International Conference on Learning Representations , year=

Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=

work page
[19]

2021 , eprint=

How to Train Your Energy-Based Models , author=. 2021 , eprint=

work page 2021
[20]

Proceedings of the 40th International Conference on Machine Learning , pages =

Consistency Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[21]

2025 , eprint=

Dream 7B: Diffusion Large Language Models , author=. 2025 , eprint=

work page 2025
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

One-Step Diffusion with Distribution Matching Distillation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[25]

Beyond Autoregression: Fast

Justin Deschenaux and Caglar Gulcehre , booktitle=. Beyond Autoregression: Fast. 2025 , url=

work page 2025
[26]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Lou, Aaron and Meng, Chenlin and Ermon, Stefano , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[28]

Proceedings of the 42nd International Conference on Machine Learning , pages =

The Diffusion Duality , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

work page 2025
[29]

Second Conference on Language Modeling , year=

Unifying Autoregressive and Diffusion-Based Sequence Generation , author=. Second Conference on Language Modeling , year=

work page
[30]

2025 , eprint=

Dream-Coder 7B: An Open Diffusion Language Model for Code , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference , author=. 2025 , eprint=

work page 2025
[32]

The Thirteenth International Conference on Learning Representations , year=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[33]

2026 , url=

Ling Yang and Ye Tian and Bowen Li and Xinchen Zhang and Ke Shen and Yunhai Tong and Mengdi Wang , booktitle=. 2026 , url=

work page 2026
[35]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[36]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[37]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[38]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[39]

Advances in

Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid , year =. Advances in

work page
[40]

2026 , eprint=

T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization , author=. 2026 , eprint=

work page 2026
[41]

2024 , eprint=

EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling , author=. 2024 , eprint=

work page 2024
[42]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Optimizing Temperature for Language Models with Multi-Sample Inference , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

work page 2025
[45]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

work page 2025
[46]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17981--17993. Curran Associates, Inc., 2021. URL ht...

work page 2021
[47]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[48]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Internati...

work page 2024
[49]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Residual energy-based models for text generation

Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc'Aurelio Ranzato. Residual energy-based models for text generation. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[52]

Beyond autoregression: Fast LLM s via self-distillation through time

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLM s via self-distillation through time. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=uZ5K4HeNwd

work page 2025
[53]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780--8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1...

work page 2021
[54]

Optimizing temperature for language models with multi-sample inference

Weihua Du, Yiming Yang, and Sean Welleck. Optimizing temperature for language models with multi-sample inference. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine ...

work page 2025
[55]

Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl

Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, ...

work page 2023
[56]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

work page 2025
[57]

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 133345--133385. Curran Associates, Inc., 2024. doi:10.52202...

work page doi:10.52202/079017-4239 2024
[58]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

D iffusion BERT : Improving generative masked language models with diffusion models

Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. D iffusion BERT : Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45...

work page doi:10.18653/v1/2023.acl-long.248 2023
[60]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[61]

A distributional approach to controlled text generation

Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jWkw45-9AbL

work page 2021
[62]

A tutorial on energy-based learning

Yann LeCun, Sumit Chopra, Raia Hadsell, Marc'Aurelio Ranzato, and Fu Jie Huang. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press, 2006

work page 2006
[63]

Tenenbaum

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision -- ECCV 2022, pages 423--439, Cham, 2022. Springer Nature Switzerland

work page 2022
[64]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[65]

On distillation of guided diffusion models

Chenlin Meng et al. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[66]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe

work page 2017
[67]

FS - DFM : Fast and accurate long text generation with few-step diffusion language models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. FS - DFM : Fast and accurate long text generation with few-step diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ue1zFeD275

work page 2026
[68]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=KnqiC0znVF

work page 2026
[69]

Michael Polanyi.The Tacit Dimension

Guilherme Penedo, Hynek Kydl\' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing System...

work page doi:10.52202/079017-0970 2024
[70]

MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers . In Advances in Neural Information Processing Systems , volume 34, pages 4816--4828. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pa...

work page 2021
[71]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[72]

Simple and effective masked diffusion language models

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages...

work page doi:10.52202/079017-4135 2024
[73]

The diffusion duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and Volodymyr Kuleshov. The diffusion duality. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...

work page 2025
[74]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI

work page 2022
[75]

Yang Song and Diederik P. Kingma. How to train your energy-based models, 2021. URL https://arxiv.org/abs/2101.03288

work page arXiv 2021
[76]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211--32252. PMLR, 23--29 J...

work page 2023
[77]

Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...

work page arXiv 2025
[78]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Remasking discrete diffusion models with inference-time scaling

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=IJryQAOy0p

work page 2026
[80]

18 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream-coder 7b: An open diffusion language model for code, 2025. URL https://arxiv.org/abs/2509.01142

work page arXiv 2025
[81]

MM a DA : Multimodal large diffusion language models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MM a DA : Multimodal large diffusion language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=wczmXLuLGd

work page 2026
[82]

Diffusion of thought: Chain-of-thought reasoning in diffusion language models

Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, and Lingpeng Kong. Diffusion of thought: Chain-of-thought reasoning in diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, ...

work page doi:10.52202/079017-3343 2024
[83]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487

work page internal anchor Pith review arXiv 2025
[84]

One-step diffusion with distribution matching distillation

Tianwei Yin et al. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[85]

arXiv preprint arXiv:2403.14541

Shimao Zhang, Yu Bao, and Shujian Huang. Edt: Improving large language models' generation by entropy-based dynamic temperature sampling, 2024. URL https://arxiv.org/abs/2403.14541

work page arXiv 2024
[86]

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, and Dimitris N. Metaxas. T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization, 2026. URL https://arxiv.org/abs/2602.12262

work page arXiv 2026
[87]

Texygen: A benchmarking platform for text generation models

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '18, page 1097–1100, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356572. doi:1...

work page doi:10.1145/3209978.3210080 2018