Recognition: 2 theorem links
· Lean TheoremTrajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Pith reviewed 2026-05-11 03:05 UTC · model grok-4.3
The pith
Guided energy selection during trajectory construction lets an 8-step discrete flow student outperform its 1024-step teacher while running 128 times faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Discrete flow matching trajectories formed by blind jumps can be reshaped at training time by an energy compass that selects coherent midpoints, allowing the distilled few-step student to exceed the perplexity of the original multi-step teacher on language modeling tasks.
What carries the argument
Trajectory-Shaped Discrete Flow Matching (TS-DFM) with a lightweight energy compass that evaluates and selects high-coherence continuations at trajectory midpoints.
If this is right
- An 8-step discrete generator can deliver lower perplexity than a 1024-step teacher on language modeling while delivering 128 times faster inference.
- Trajectory quality improvements during training transfer to better performance across source distributions and evaluators of different scales.
- TS-DFM sets a new state-of-the-art perplexity among compared discrete generation methods, including those using six times more data or five times larger models.
Where Pith is reading between the lines
- The same midpoint navigation idea could be tested in continuous flow or diffusion models where trajectory quality also limits few-step performance.
- If the compass works because it rewards local coherence, it might generalize to other sequential generation tasks such as code or molecular sequences.
- Training cost rises modestly from the extra compass evaluations, but the resulting student models could enable deployment on resource-constrained devices.
Load-bearing premise
The energy compass can identify genuinely coherent continuations at midpoints without introducing biases or overfitting that would fail to generalize to new data or different model sizes.
What would settle it
If an 8-step TS-DFM student records higher perplexity than the 1024-step teacher on held-out text from the same distribution, the claim that shaped trajectories produce superior few-step performance would be refuted.
read the original abstract
Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Trajectory-Shaped Discrete Flow Matching (TS-DFM) to improve few-step discrete flow matching for text generation. It argues that poor-quality training trajectories (built from blind stochastic jumps) are the main bottleneck in distillation, not student capacity. TS-DFM introduces a lightweight energy compass that evaluates and selects coherent candidate continuations at midpoints during trajectory construction; shaping occurs only at training time. On 170M-parameter language models, the resulting 8-step student achieves 32% lower perplexity than the 1024-step teacher (128x faster) and outperforms other discrete-generation baselines, with gains consistent across source distributions and three evaluators.
Significance. If the central empirical claim holds after clarifying the energy compass, the result would be significant for efficient discrete generative modeling: it shows that trajectory quality can be improved at training time to yield students that outperform their multi-step teachers, offering a path to accelerate sampling in flow matching without changing inference cost. The consistency across evaluators and source distributions, if robust, strengthens the case that guided navigation addresses a fundamental limitation rather than a capacity issue.
major comments (3)
- [Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.
- [§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.
- [§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.
minor comments (2)
- [§3] Notation for the energy function and candidate scoring is introduced without a clear equation or algorithm box, making reproduction difficult.
- [Abstract] The abstract claims 'gains consistent across source distributions' but the corresponding table or figure does not list the exact distributions or sample sizes used for each.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional clarity and analysis will strengthen the manuscript. We agree that the energy compass requires explicit documentation and that statistical robustness checks are warranted. We will revise the paper to incorporate these elements while preserving the core claims. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim that the 8-step TS-DFM student achieves 32% lower perplexity than the 1024-step teacher depends on the energy compass producing higher-quality trajectories without embedding training-specific biases. However, the manuscript provides no explicit definition, feature set, or pseudocode for the compass (only the phrase 'lightweight energy compass'), nor any ablation showing that its selections are independent of the teacher logits or training distribution. This leaves open the possibility that the student is simply imitating an easier or biased path rather than learning a genuinely superior few-step mapping.
Authors: We acknowledge that the current manuscript does not supply an explicit definition, feature set, or pseudocode for the energy compass. In the revised version we will add a complete specification of the compass (including the coherence and entropy features it evaluates), the selection algorithm in pseudocode, and a dedicated ablation that compares guided versus unguided trajectories while holding teacher logits fixed. This will directly address whether the student learns a superior mapping or merely an easier path. revision: yes
-
Referee: [§3.2] §3.2 (Energy-Navigated Distillation): the assertion that 'all shaping is training-only' and inference cost is unchanged is load-bearing for the efficiency claim, yet the paper does not specify how the compass is computed (e.g., whether it uses teacher logits, external heuristics, or learned parameters) or demonstrate that it introduces no additional parameters or data leakage that could affect generalization to new evaluators.
Authors: The compass is a fixed, parameter-free heuristic that scores candidate continuations using only intrinsic sequence statistics (local token energy and short-range coherence) and does not access teacher logits, learned parameters, or training data beyond the current trajectory. All evaluations occur exclusively during training-time trajectory construction. We will expand §3.2 with the exact computation procedure and a short verification that no parameters are added and no information leaks to inference or downstream evaluators. revision: yes
-
Referee: [§5] §5, Table 2 and Figure 3: while perplexity gains are reported across three evaluators of increasing scale, there are no statistical significance tests, variance estimates across random seeds, or controls for post-hoc choices in compass design. The reported consistency does not rule out overfitting to the training distribution unless the energy function is shown to be parameter-free and uncorrelated with the evaluator models.
Authors: We agree that statistical significance tests, seed-wise variance, and explicit controls for design choices are missing and will add them in the revision (including p-values and standard deviations over multiple runs). The energy function is parameter-free by construction, depending solely on general sequence properties rather than model-specific logits; we will include a short analysis confirming its lack of correlation with the three evaluator families used. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external comparisons
full rationale
The paper's core contribution is an empirical distillation procedure (TS-DFM) that replaces blind stochastic jumps in discrete flow matching trajectories with guided selection via a lightweight energy compass at training time only. All reported performance numbers (32% perplexity reduction at 8 steps vs. 1024-step teacher, consistency across distributions and evaluators, superiority to baselines trained on more data or larger models) are obtained from direct experimental comparisons against independent baselines and multiple external evaluators. No equations, fitted parameters, or self-citations are invoked to derive the performance gains by construction; the energy compass is presented as an auxiliary training heuristic whose effect is measured rather than assumed. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
energy compass
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel uncleara lightweight energy compass—a scalar energy model trained on generation-aware negatives—to select the highest-quality one... Lenergy = L_NCE + λ_reg · L_reg + λ_order · L_order
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclearNavigation shaping selects among candidates via an energy compass, activating at t ≥ τ
Reference graph
Works this paper leans on
-
[1]
and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =
Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , url =
-
[2]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Universal Guidance for Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[3]
Proceedings of the 41st International Conference on Machine Learning , pages =
Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
work page 2024
-
[4]
International Conference on Learning Representations (ICLR) , year=
Residual Energy-Based Models for Text Generation , author=. International Conference on Learning Representations (ICLR) , year=
-
[5]
Diffusion Models Beat GANs on Image Synthesis , url =
Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion Models Beat GANs on Image Synthesis , url =
-
[6]
Du, Yilun and Durkan, Conor and Strudel, Robin and Tenenbaum, Joshua B. and Dieleman, Sander and Fergus, Rob and Sohl-Dickstein, Jascha and Doucet, Arnaud and Grathwohl, Will Sussman , booktitle =. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and. 2023 , editor =
work page 2023
-
[7]
Compositional Visual Generation with Composable Diffusion Models
Liu, Nan and Li, Shuang and Du, Yilun and Torralba, Antonio and Tenenbaum, Joshua B. Compositional Visual Generation with Composable Diffusion Models. Computer Vision -- ECCV 2022. 2022
work page 2022
-
[8]
The Thirteenth International Conference on Learning Representations , year=
One Step Diffusion via Shortcut Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[11]
Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=
work page 2015
-
[12]
Amin Karimi Monsefi and Nikhil Bhendawade and Manuel Rafael Ciosici and Dominic Culver and Yizhe Zhang and Irina Belousova , booktitle=. 2026 , url=
work page 2026
-
[13]
International Conference on Learning Representations , year=
A Distributional Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=
-
[14]
Predicting Structured Data , publisher=
A Tutorial on Energy-Based Learning , author=. Predicting Structured Data , publisher=
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
On Distillation of Guided Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[16]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Large Language Diffusion Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[17]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =
Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =
-
[18]
International Conference on Learning Representations , year=
Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=
- [19]
-
[20]
Proceedings of the 40th International Conference on Machine Learning , pages =
Consistency Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
- [21]
-
[22]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
One-Step Diffusion with Distribution Matching Distillation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[25]
Justin Deschenaux and Caglar Gulcehre , booktitle=. Beyond Autoregression: Fast. 2025 , url=
work page 2025
-
[26]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Lou, Aaron and Meng, Chenlin and Ermon, Stefano , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
-
[27]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[28]
Proceedings of the 42nd International Conference on Machine Learning , pages =
The Diffusion Duality , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =
work page 2025
-
[29]
Second Conference on Language Modeling , year=
Unifying Autoregressive and Diffusion-Based Sequence Generation , author=. Second Conference on Language Modeling , year=
-
[30]
Dream-Coder 7B: An Open Diffusion Language Model for Code , author=. 2025 , eprint=
work page 2025
-
[31]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference , author=. 2025 , eprint=
work page 2025
-
[32]
The Thirteenth International Conference on Learning Representations , year=
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[33]
Ling Yang and Ye Tian and Bowen Li and Xinchen Zhang and Ke Shen and Yunhai Tong and Mengdi Wang , booktitle=. 2026 , url=
work page 2026
-
[35]
International Conference on Learning Representations , year=
Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
-
[36]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
- [37]
-
[38]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[39]
Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid , year =. Advances in
-
[40]
T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization , author=. 2026 , eprint=
work page 2026
-
[41]
EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling , author=. 2024 , eprint=
work page 2024
-
[42]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Optimizing Temperature for Language Models with Multi-Sample Inference , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =
work page 2025
-
[45]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x
work page 2025
-
[46]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17981--17993. Curran Associates, Inc., 2021. URL ht...
work page 2021
-
[47]
Universal guidance for diffusion models
Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[48]
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st Internati...
work page 2024
-
[49]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
Residual energy-based models for text generation
Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc'Aurelio Ranzato. Residual energy-based models for text generation. In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[52]
Beyond autoregression: Fast LLM s via self-distillation through time
Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLM s via self-distillation through time. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=uZ5K4HeNwd
work page 2025
-
[53]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780--8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1...
work page 2021
-
[54]
Optimizing temperature for language models with multi-sample inference
Weihua Du, Yiming Yang, and Sean Welleck. Optimizing temperature for language models with multi-sample inference. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine ...
work page 2025
-
[55]
Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, ...
work page 2023
-
[56]
One step diffusion via shortcut models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS
work page 2025
-
[57]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 133345--133385. Curran Associates, Inc., 2024. doi:10.52202...
-
[58]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
D iffusion BERT : Improving generative masked language models with diffusion models
Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. D iffusion BERT : Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45...
-
[60]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[61]
A distributional approach to controlled text generation
Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jWkw45-9AbL
work page 2021
-
[62]
A tutorial on energy-based learning
Yann LeCun, Sumit Chopra, Raia Hadsell, Marc'Aurelio Ranzato, and Fu Jie Huang. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press, 2006
work page 2006
-
[63]
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision -- ECCV 2022, pages 423--439, Cham, 2022. Springer Nature Switzerland
work page 2022
-
[64]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[65]
On distillation of guided diffusion models
Chenlin Meng et al. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[66]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe
work page 2017
-
[67]
FS - DFM : Fast and accurate long text generation with few-step diffusion language models
Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. FS - DFM : Fast and accurate long text generation with few-step diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ue1zFeD275
work page 2026
-
[68]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=KnqiC0znVF
work page 2026
-
[69]
Michael Polanyi.The Tacit Dimension
Guilherme Penedo, Hynek Kydl\' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing System...
-
[70]
MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE : Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers . In Advances in Neural Information Processing Systems , volume 34, pages 4816--4828. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pa...
work page 2021
-
[71]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[72]
Simple and effective masked diffusion language models
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages...
-
[73]
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and Volodymyr Kuleshov. The diffusion duality. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proc...
work page 2025
-
[74]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI
work page 2022
- [75]
-
[76]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211--32252. PMLR, 23--29 J...
work page 2023
-
[77]
Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...
-
[78]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Remasking discrete diffusion models with inference-time scaling
Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=IJryQAOy0p
work page 2026
-
[80]
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream-coder 7b: An open diffusion language model for code, 2025. URL https://arxiv.org/abs/2509.01142
-
[81]
MM a DA : Multimodal large diffusion language models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MM a DA : Multimodal large diffusion language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=wczmXLuLGd
work page 2026
-
[82]
Diffusion of thought: Chain-of-thought reasoning in diffusion language models
Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, and Lingpeng Kong. Diffusion of thought: Chain-of-thought reasoning in diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, ...
-
[83]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487
work page internal anchor Pith review arXiv 2025
-
[84]
One-step diffusion with distribution matching distillation
Tianwei Yin et al. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[85]
arXiv preprint arXiv:2403.14541
Shimao Zhang, Yu Bao, and Shujian Huang. Edt: Improving large language models' generation by entropy-based dynamic temperature sampling, 2024. URL https://arxiv.org/abs/2403.14541
-
[86]
Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, and Dimitris N. Metaxas. T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization, 2026. URL https://arxiv.org/abs/2602.12262
-
[87]
Texygen: A benchmarking platform for text generation models
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '18, page 1097–1100, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356572. doi:1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.