Recognition: 2 theorem links
· Lean TheoremFlow Map Language Models: One-step Language Modeling via Continuous Denoising
Pith reviewed 2026-05-15 20:58 UTC · model grok-4.3
The pith
Continuous flows over one-hot token embeddings match discrete diffusion quality and enable one-step generation that exceeds eight-step baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Language models based on continuous flows over one-hot token embeddings can match state-of-the-art discrete diffusion baselines on LM1B and OWT while defining a unique flow map that supports direct one-step inference. The flow and its associated flow map are learned with simple cross-entropy objectives that respect simplex geometry. Distillation of the flow language model into a flow map language model yields one-step generation that exceeds the eight-step quality of recent few-step discrete diffusion language models.
What carries the argument
The flow map induced by the continuous flow over one-hot token embeddings, which is learned directly to enable efficient few-step sampling without requiring discrete constraints.
If this is right
- The flow language model matches state-of-the-art discrete diffusion baselines on LM1B and OWT.
- The distilled flow map language model achieves one-step generation that exceeds the quality of eight-step outputs from recent discrete diffusion models.
- The continuous formulation challenges the hypothesis that discrete noising processes are required for generative modeling over discrete modalities.
- Simple cross-entropy objectives suffice for learning both the flow and the flow map while respecting simplex geometry.
Where Pith is reading between the lines
- The method could be tested on other discrete sequence tasks such as code or protein generation to check whether the one-step advantage holds beyond natural language.
- If the flow map preserves token structure reliably at scale, it may reduce the need for multi-step sampling schedules in production language systems.
- The absence of post-hoc discrete corrections in the continuous approach suggests potential simplifications for combining language models with continuous control or editing operations.
Load-bearing premise
A continuous flow defined over one-hot token embeddings can be learned such that the associated flow map preserves discrete token structure and yields high-quality samples without additional discrete constraints or post-hoc corrections.
What would settle it
An experiment in which one-step samples from the distilled flow map language model receive lower quality scores than eight-step samples from the compared discrete diffusion models on the same LM1B or OWT evaluation metrics.
Figures
read the original abstract
Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Flow Language Models (FLM) as continuous flows over one-hot token embeddings on the simplex, trained via cross-entropy objectives that respect simplex geometry. It introduces three variants of flow-map distillation to obtain Flow Map Language Models (FMLM) that perform one-step generation, claiming these match state-of-the-art discrete diffusion baselines on LM1B and OpenWebText while exceeding the quality of recent 8-step discrete diffusion models.
Significance. If the central empirical claims hold, the work would be significant for demonstrating that continuous flows can model discrete modalities without discrete noising processes or post-hoc corrections, enabling faster non-autoregressive generation. The availability of code at https://github.com/david3684/flm is a clear strength that supports reproducibility and allows direct verification of the flow-map construction and training objectives.
major comments (3)
- [Abstract and §3] Abstract and §3 (Flow Map Distillation): the claim that the distilled flow map produces high-quality discrete tokens 'without requiring additional discrete constraints or post-hoc corrections' is load-bearing for the one-step superiority result; the manuscript must explicitly state and demonstrate whether one-step outputs are exactly one-hot vectors or require argmax/projection, and provide analysis showing that trajectories concentrate near simplex vertices rather than merely matching marginals.
- [Experimental Results] Experimental section (results on LM1B and OWT): the reported matching of SOTA and outperformance over 8-step discrete baselines requires full tables with exact metrics (perplexity, MAUVE, or equivalent), all baselines with step counts, error bars or multiple seeds, and ablation on the three distillation choices; without these details the central empirical claim has only moderate support.
- [§4] §4 (training objectives): the cross-entropy loss on the continuous simplex formulation is presented as directly learning the flow map, but the manuscript should clarify how this objective guarantees preservation of discrete token structure in the one-step map versus merely approximating the data distribution in aggregate.
minor comments (2)
- [§2] Notation in the flow definition: the relationship between the continuous flow ODE and the learned flow map should be stated with an explicit equation reference to avoid ambiguity in how the map is obtained from the flow.
- [Figure 1] Figure 1 and trajectory visualizations: captions should explicitly note whether plotted points are raw flow outputs or post-processed, and include scale for simplex concentration.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment point-by-point below, providing clarifications and indicating where revisions have been made to the manuscript to strengthen the presentation and support for our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Flow Map Distillation): the claim that the distilled flow map produces high-quality discrete tokens 'without requiring additional discrete constraints or post-hoc corrections' is load-bearing for the one-step superiority result; the manuscript must explicitly state and demonstrate whether one-step outputs are exactly one-hot vectors or require argmax/projection, and provide analysis showing that trajectories concentrate near simplex vertices rather than merely matching marginals.
Authors: We agree that explicit clarification is needed on this point. In the revised manuscript, we now state in the abstract and §3 that the flow map outputs a continuous vector on the simplex, with discrete tokens recovered via argmax (a deterministic, geometry-respecting operation with no learned parameters). We argue this does not qualify as an 'additional discrete constraint or post-hoc correction' because it involves no extra denoising steps, masking, or distribution adjustments beyond the learned map itself—unlike the post-processing often required in discrete diffusion. We have added quantitative analysis (new Figure 3 and accompanying text) measuring trajectory concentration via average L1 distance to the nearest vertex and entropy of the output distribution, demonstrating that one-step outputs concentrate near vertices far more than marginal matching alone would imply. These additions directly support the one-step superiority claim. revision: yes
-
Referee: [Experimental Results] Experimental section (results on LM1B and OWT): the reported matching of SOTA and outperformance over 8-step discrete baselines requires full tables with exact metrics (perplexity, MAUVE, or equivalent), all baselines with step counts, error bars or multiple seeds, and ablation on the three distillation choices; without these details the central empirical claim has only moderate support.
Authors: We acknowledge that the original experimental reporting could be more comprehensive. The revised manuscript now includes expanded tables (new Tables 2 and 3) reporting exact perplexity and MAUVE scores for FLM, FMLM, and all baselines, with explicit step counts listed for each method. Results are reported with standard deviations from three independent random seeds. We have also added a dedicated ablation subsection comparing the three distillation variants (with full metrics on both LM1B and OWT), including controls for training compute. These revisions provide the requested details and strengthen empirical support for the central claims. revision: yes
-
Referee: [§4] §4 (training objectives): the cross-entropy loss on the continuous simplex formulation is presented as directly learning the flow map, but the manuscript should clarify how this objective guarantees preservation of discrete token structure in the one-step map versus merely approximating the data distribution in aggregate.
Authors: We thank the referee for this observation on the objective's properties. In the revised §4, we have expanded the explanation to show that the cross-entropy loss is computed directly between the flow map's continuous output and the one-hot target vectors. Because the loss is minimized only when the output approaches a vertex (due to the geometry of the simplex and the properties of cross-entropy), the optimization inherently favors discrete structure in the one-step map rather than just aggregate marginal matching. We include a short derivation demonstrating that, under the flow map's Lipschitz continuity assumption, convergence of the loss implies concentration around vertices. This distinguishes our approach from methods that only match distributions without vertex-seeking pressure. revision: yes
Circularity Check
No significant circularity; continuous flow and map learned independently via cross-entropy on simplex
full rationale
The paper defines FLM as a continuous flow over one-hot embeddings trained directly with cross-entropy objectives respecting simplex geometry, then distills it to FMLM via one of three explicit map choices. These steps are presented as independent of discrete diffusion baselines; performance matching on LM1B/OWT and superiority in one-step regime are empirical outcomes, not reductions by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim that a unique flow map exists and can be learned directly is supported by the continuous formulation itself rather than imported from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continuous flows over one-hot token embeddings respect the simplex geometry sufficiently for generative modeling of discrete tokens
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leverage a simple continuous interpolation between Gaussian noise and a one-hot encoding of language data... the two-time denoiser δs,t ... KL-based semigroup objective
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continuous flows over one-hot token embeddings define a unique flow map
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Sampling from Flow Language Models via Marginal-Conditioned Bridges
Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and r...
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
ELF: Embedded Language Flows
ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
-
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
-
Coupling Models for One-Step Discrete Generation
Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. (page 1)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Gemini: A Family of Highly Capable Multimodal Models
Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. (page 1)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. (page 1)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 1, 2025. (pages 1 and 2)
-
[5]
Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion/, 2025. Accessed: 2026-01-25. (page 1) 19
work page 2025
-
[6]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025. (page 1)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024. (pages 1, 3, 14, 26, and 45)
-
[8]
Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion -language.html, 2023. Accessed: 2026-01-25. (page 2)
work page 2023
-
[9]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024. (pages 2, 12, and 26)
-
[10]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. (pages 2, 4, 44, 46, and 55)
work page internal anchor Pith review arXiv 2025
-
[11]
Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, et al. Parallelbench: Understanding the trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025. (pages 2, 4, and 26)
-
[12]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343,
-
[13]
(pages 2, 4, and 26)
-
[14]
Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022. (pages 2, 4, 5, 10, 16, 17, 26, and 40)
-
[15]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. (pages 2 and 4)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023. (pages 2, 4, 5, and 34)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. (page 2)
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[18]
Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025. (pages 2, 6, 7, 26, 27, 28, 41, and 42)
-
[19]
arXiv preprint arXiv:2406.075072(3), 9 (2024)
Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv:2406.07507, 2025. (pages 2, 6, 7, 26, 27, 28, 31, and 41)
-
[20]
The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025. (pages 2, 10, 11, 13, 14, 26, 43, 44, 45, 46, and 56)
-
[21]
Candi: Hybrid discrete-continuous diffusion models
Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025. (pages 2, 10, 13, 16, 17, 26, and 43)
-
[22]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. (page 2) 20
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Attractor dynamics and parallelism in a connectionist sequential machine
Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (page 3)
work page 1986
-
[24]
Finding structure in time.Cognitive science, 14(2):179–211, 1990
Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990. (page 3)
work page 1990
-
[25]
A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003
Yoshua Bengio, R´ ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003. (page 3)
work page 2003
-
[26]
Non-Autoregressive Neural Machine Translation
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation.arXiv preprint arXiv:1711.02281, 2017. (pages 3 and 26)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018. (page 3)
work page 2018
-
[28]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021. (pages 3, 6, and 26)
work page 2021
-
[29]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023. (page 3)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. (pages 3, 11, 13, 26, 43, and 44)
work page 2024
-
[31]
Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024
Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024. (pages 3, 16, and 26)
-
[32]
Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,
-
[33]
(pages 3, 6, and 26)
-
[34]
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022. (pages 4, 6, 8, and 26)
work page 2022
-
[35]
Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023. (pages 4 and 26)
work page 2023
-
[36]
Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022
Robin Strudel, Corentin Tallec, Florent Altch´ e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022. (pages 4 and 26)
-
[37]
Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023. (pages 4 and 26)
work page 2023
-
[38]
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022. (pages 4 and 26)
-
[39]
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596,
-
[40]
(pages 4, 16, and 26) 21
-
[41]
Tess: Text-to-text self-conditioned simplex diffusion
Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024. (pages 4, 16, and 26)
work page 2024
-
[42]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. (pages 4 and 5)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. (page 4)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Floor Eijkelboom, Grigory Bartosh, Christian Andersson Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation.Advances in Neural Information Processing Systems, 37:11735–11764, 2024. (page 5)
work page 2024
-
[45]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. (pages 6, 7, 17, 18, 26, and 27)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Improved Mean Flows: On the Challenges of Fastforward Generative Models
Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. (pages 6, 7, 26, and 28)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025
Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025. (pages 7, 26, 27, and 28)
-
[48]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. (pages 7, 26, and 27)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557, 2024. (pages 7, 10, 26, 27, 28, and 43)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025
Dejan Stancevic, Florian Handke, and Luca Ambrogioni. Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025. (pages 10 and 40)
-
[51]
Tero Karras, Miika Aittala, Tuomas Kynk¨ a¨ anniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024. (pages 11 and 16)
work page 2024
-
[52]
Jerry Huang, Justin Lin, Sheel Shah, Kartik Nair, and Nicholas M. Boffi. How to guide your flow: Steering flow maps for rapid test-time alignment, 2025. Forthcoming. (pages 11, 16, and 17)
work page 2025
-
[53]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013. (page 11)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[54]
Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019. (page 11)
work page 2019
-
[55]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. (page 11)
work page 2023
-
[56]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. (page 11)
work page 2024
-
[57]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. (pages 12 and 43) 22
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[58]
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models. arXiv preprint arXiv:2512.02636, 2025. (page 12)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. (pages 12 and 45)
work page 2019
-
[60]
Continuous diffusion model for language modeling.arXiv preprint arXiv:2502.11564, 2025
Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling.arXiv preprint arXiv:2502.11564, 2025. (pages 13, 16, 17, 26, and 43)
-
[61]
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024. (pages 14, 26, and 44)
-
[62]
Texygen: A benchmarking platform for text generation models
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018. (pages 15 and 44)
work page 2018
-
[63]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. (pages 16, 46, and 58)
work page 2015
-
[64]
Neural network acceptability judgments
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. (page 16)
work page 2019
-
[65]
Learning word vectors for sentiment analysis
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011. (page 16)
work page 2011
-
[66]
TweetEval: Unified benchmark and comparative evaluation for tweet classification
Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online, November 2020. Association for Computational Lin...
-
[67]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. (page 16)
work page 2019
-
[68]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. (page 16)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[69]
Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025
Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025. (pages 16, 17, and 26)
-
[70]
Mariia Drozdova. Can continuous-time diffusion models generate and solve globally constrained discrete problems? a study on sudoku.arXiv preprint arXiv:2601.20363, 2026. (page 18)
-
[71]
Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large language models latently perform multi-hop reasoning? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10210–10229, 2024. (page 18)
work page 2024
-
[72]
Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025
Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025. (page 26) 23
-
[73]
Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling.arXiv preprint arXiv:2510.01329, 2025. (page 26)
-
[74]
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022. (page 26)
-
[75]
Categorical flow matching on statistical manifolds
Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37:54787–54819, 2024. (page 26)
work page 2024
-
[76]
Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙I Ceylan, Michael Bronstein, and Avishek J Bose. Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024. (page 26)
work page 2024
-
[77]
M., Hartmann, M., and Klami, A
Bernardo Williams, Victor M Yeom-Song, Marcelo Hartmann, and Arto Klami. Simplex-to-euclidean bijections for categorical flow matching.arXiv preprint arXiv:2510.27480, 2025. (page 26)
-
[78]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. (pages 26 and 27)
work page 2023
-
[79]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. (page 26)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[80]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025. (page 26)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.