pith. machine review for the scientific record. sign in

arxiv: 2605.14531 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords language generationstochastic optimal controlHamilton-Jacobi-Bellman equationflow matchingdiffusion modelslatent control spaceparallel samplingManta-LM
0
0 comments X

The pith

Reformulating language generation as stochastic optimal control lets flow matching approximate the HJB equation in latent space, yielding models with both high fidelity and parallel sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper casts language generation as a stochastic optimal control problem to unify the analysis of autoregressive and diffusion models. It identifies their shared limitations as arising from trajectory singularity, adjoint state vanishing, and gradient absence. Approximating the Hamilton-Jacobi-Bellman equation with flow matching inside a rectified latent control space allows the Manta-LM model and its global integral operator to approximate the global vector field. A reader would care because the approach promises to break the efficiency-fidelity tradeoff while keeping sampling cheap and parallel.

Core claim

By reformulating language generation as a stochastic optimal control problem and approximating the solution to the Hamilton-Jacobi-Bellman equation with Flow Matching as the optimal trajectory solver within the rectified latent control space, Manta-LM with Global Integral Operator approximates the global vector field and thereby realizes a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling.

What carries the argument

Flow Matching acting as the optimal trajectory solver inside the rectified latent control space, with the Global Integral Operator in Manta-LM used to approximate the global vector field for the closed-loop controller.

If this is right

  • The Efficiency-Fidelity Paradox, Irreversibility Error Propagation, and Optimization Tractability issues are bypassed.
  • High-fidelity text generation becomes compatible with efficient, low-cost parallel sampling.
  • Improved stability, efficiency, and controllability appear on language modeling and conditional generation tasks.
  • Direct solution of the HJB PDE is avoided while still obtaining an optimal policy that acts as a closed-loop controller.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-control reformulation could be tested on non-text sequence tasks such as code or protein generation.
  • The closed-loop controller structure may support real-time steering of generation outputs without retraining.
  • If the global vector field approximation holds, similar flow-matching solvers could be tried on other intractable optimal-control problems in machine learning.

Load-bearing premise

That flow matching inside the rectified latent control space can reliably approximate the solution to the Hamilton-Jacobi-Bellman equation and thereby resolve trajectory singularity, adjoint state vanishing, and gradient absence.

What would settle it

An experiment in which parallel sampling with the trained Manta-LM model produces measurably lower fidelity than sequential baselines on the same language modeling or conditional generation benchmarks.

Figures

Figures reproduced from arXiv: 2605.14531 by Liang Lin, Pengxu Wei, Weijian Deng, Xiangyang Ji, Yuliang Huang, ZiYi Dong.

Figure 1
Figure 1. Figure 1: Generation dynamics. On a non-convex manifold, (a) AR and Diffusion are trapped in a slow, myopic crawl along the high-curvature density ridge. (b) In contrast, our method approx￾imates the global optimal trajectory, bypassing curvature via the rectified latent geometry (energy-minimizing geodesic) for im￾proved efficiency. the optimal controller in Equation (3), targeting high data fidelity with low infer… view at source ↗
Figure 2
Figure 2. Figure 2: Visualizing Generative Dynamics and Error Propa￾gation on BVP task. Color from pink to blue denotes generation progress. (a) AR suffers from compounding errors (blue lines in (d)) due to open-loop myopia, drifting off-manifold. (b) Discrete DLM relies on stochastic combinatorial search showing jagged tra￾jectories caused by geometric blindness (lack of gradients). (c) Our Manta-LM acts as an optimal closed… view at source ↗
Figure 3
Figure 3. Figure 3: Geometric comparison. Unlike (a) Autoregressive models’ serial paths or (b-c) Diffusion baselines’ high-curvature trajectories in ill-conditioned spaces, (d) Ours operates on a recti￾fied latent manifold. The learned optimal vector field vθ enables energy-minimizing, straight-line transport from noise to data [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency evaluation with inference throughput. transforming the ill-conditioned high-frequency regression problem into a well-conditioned one, thereby making the HJB-inspired dynamics easier to approximate with Flow Matching and efficient large-step integration. Geometric Regularity and Optimization Stability. Fig￾ure 6 contrasts the rugged optimization landscape of discrete baselines, which reflects sev… view at source ↗
Figure 5
Figure 5. Figure 5: Stiffness Analysis. The raw Token (Embedding) Space exhibits extreme stiffness and high curvature, indicating an ill￾conditioned control landscape that forces adaptive solvers (RK45) to high NFE. In contrast, our Rectified Latent Space maintains low stiffness and near-linear trajectories, verifying the efficacy of VAE. (a) Auto-Regressive (GPT-2) (b) Discrete Diffusion (RADD) (c) Manta-LM (Ours) [PITH_FUL… view at source ↗
Figure 6
Figure 6. Figure 6: Optimization landscapes of different generation paradigms. (a) AR exhibits sharp and unstable geometry. (b) Discrete diffusion leads to fragmented and irregular landscapes. (c) Our Manta-LM yields a smooth and well-conditioned land￾scape, enabling stable optimization. 6. Conclusion We presented Manta-LM, a framework that studies and re￾imagines text generation as Stochastic Optimal Control prob￾lem. By app… view at source ↗
Figure 7
Figure 7. Figure 7: Model structure and pipeline. due to the high-frequency discontinuities of the semantic energy landscape V across discrete tokens, no single vector v can satisfy the first-order Taylor approximation for the local neighborhood, formally implying that ∄ v such that V (z + d) − V (z) ≈ ⟨v, d⟩ holds, thereby confirming the structural absence of gradient guidance. ∄ v ∈ TzM s.t. ⟨v, d⟩ ≈ V (z + d) − V (z). (24)… view at source ↗
Figure 8
Figure 8. Figure 8: Analysis on interplay between CFG guidance strength and integration fidelity • Optimal Regime (w ∈ [3.0, 5.0]): This setting achieves the best quality-efficiency trade-off. Metrics saturate rapidly (within 20–30 steps), indicating that the vector field is sufficiently aligned with the condition while remaining smooth enough for coarse-step integration. • Over-Guided Regime (w ≥ 7.0): We observe a sharp per… view at source ↗
Figure 9
Figure 9. Figure 9: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “what was the best day of your life, and what happened?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “how can i be a good geologist?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualizing error correction capabilities across different models. Red text indicates corrupted or erroneous tokens introduced by noise. while yellow text denotes tokens that are semantically consistent with the ground-truth text but differ in surface form. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. Rabat – Dutch far-right lawmaker Geert Wilders was found not guilty of hate speech but guilty of discrimination and group insult. He will face no punishment. The verdict is in reference to comments Wilders, the leader of the Freedom Party, made… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
read the original abstract

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reformulates language generation as a stochastic optimal control problem, using this lens to unify autoregressive and diffusion models and attribute their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) to trajectory singularity, adjoint state vanishing, and gradient absence. It approximates the Hamilton-Jacobi-Bellman equation via Flow Matching as the trajectory solver inside a rectified latent control space, introducing Manta-LM equipped with a Global Integral Operator to approximate the global vector field. The resulting closed-loop controller is claimed to deliver both high-fidelity generation and efficient parallel sampling, with supporting empirical results on language modeling and conditional generation tasks.

Significance. If the Flow Matching step in rectified latent space can be shown to approximate the HJB solution with quantifiable error and to preserve optimality of the resulting policy, the work would supply a principled control-theoretic unification of existing generative paradigms and a concrete route to simultaneous fidelity and sampling efficiency. The closed-loop formulation and latent rectification are technically interesting and could influence future controllable generation research, provided the central approximation is rigorously justified.

major comments (3)
  1. [§4] §4 (Method): No derivation is supplied that equates the Flow Matching objective (or its rectified latent variant) to the HJB value function, nor are error bounds or optimality guarantees provided for the resulting policy once the Global Integral Operator is introduced. The claim that this construction resolves trajectory singularity, adjoint vanishing, and gradient absence therefore rests on an unproven modeling assumption rather than a demonstrated reduction.
  2. [§3.2] §3.2 (Reformulation): The mapping from the standard autoregressive or diffusion training objective to the stochastic control problem and its associated HJB PDE is stated at a high level but not derived step-by-step; it is therefore unclear whether the three listed paradoxes are formally equivalent to the cited singularities or merely heuristically linked.
  3. [Experiments] Experiments section: Reported gains on language modeling and conditional generation are not accompanied by ablations that isolate the contribution of the HJB approximation from the latent rectification or the integral operator; without such controls it is impossible to verify that performance improvements stem from the claimed optimal-control mechanism.
minor comments (2)
  1. [§4.1] Notation for the Global Integral Operator and the rectified latent control space should be introduced with explicit equations before being used in the main claims.
  2. [Abstract] The abstract and introduction repeatedly use the phrase 'approximate the global vector field' without defining the vector field or the sense in which the approximation is global; a clarifying sentence or equation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the theoretical derivations and experimental analysis without altering the core claims of the work.

read point-by-point responses
  1. Referee: [§4] §4 (Method): No derivation is supplied that equates the Flow Matching objective (or its rectified latent variant) to the HJB value function, nor are error bounds or optimality guarantees provided for the resulting policy once the Global Integral Operator is introduced. The claim that this construction resolves trajectory singularity, adjoint vanishing, and gradient absence therefore rests on an unproven modeling assumption rather than a demonstrated reduction.

    Authors: We agree that the connection between Flow Matching in rectified latent space and the HJB equation requires a more explicit derivation. In the revised manuscript we will add a dedicated subsection deriving the equivalence of the Flow Matching objective to the HJB value function under the rectified control-space formulation, including first-order error bounds on the approximation and a discussion of the conditions under which the resulting policy remains optimal. This will replace the current high-level statement with a rigorous reduction. revision: yes

  2. Referee: [§3.2] §3.2 (Reformulation): The mapping from the standard autoregressive or diffusion training objective to the stochastic control problem and its associated HJB PDE is stated at a high level but not derived step-by-step; it is therefore unclear whether the three listed paradoxes are formally equivalent to the cited singularities or merely heuristically linked.

    Authors: We will expand §3.2 with a complete step-by-step derivation that starts from the standard autoregressive and diffusion objectives, maps them onto the stochastic optimal control problem, and arrives at the associated HJB PDE. The revised text will explicitly show how the Efficiency-Fidelity Paradox, Irreversibility Error Propagation, and Optimization Tractability issues arise as direct consequences of trajectory singularity, adjoint vanishing, and gradient absence, thereby establishing formal rather than heuristic equivalence. revision: yes

  3. Referee: [Experiments] Experiments section: Reported gains on language modeling and conditional generation are not accompanied by ablations that isolate the contribution of the HJB approximation from the latent rectification or the integral operator; without such controls it is impossible to verify that performance improvements stem from the claimed optimal-control mechanism.

    Authors: We will add a new ablation study in the Experiments section that systematically disables the HJB approximation (replacing it with standard flow matching), removes latent rectification, and removes the Global Integral Operator while keeping all other components fixed. The revised results will report the isolated contribution of each element to perplexity, generation speed, and controllability metrics, allowing direct verification that the gains originate from the optimal-control formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper reformulates language generation as stochastic optimal control, identifies limitations via trajectory singularity/adjoint vanishing/gradient absence, and approximates the HJB solution by employing Flow Matching as trajectory solver inside a rectified latent control space together with the Global Integral Operator. No quoted step reduces a claimed prediction or optimality result to a quantity defined by the paper's own fitted parameters or by self-citation alone; the construction introduces new modeling elements (rectified latent space, integral operator) presented as external approximations rather than tautological redefinitions. The derivation therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the standard HJB equation from optimal control theory and the effectiveness of the flow-matching approximation in a newly introduced latent space; new entities Manta-LM and Global Integral Operator are postulated without independent evidence.

axioms (1)
  • standard math The Hamilton-Jacobi-Bellman equation provides the optimal policy for the stochastic control formulation of language generation.
    Invoked directly as the target equation whose solution is approximated.
invented entities (2)
  • Manta-LM no independent evidence
    purpose: Proposed model realizing the closed-loop controller via global vector field approximation.
    New model name and architecture introduced to implement the method.
  • Global Integral Operator no independent evidence
    purpose: Component that approximates the global vector field in the rectified latent control space.
    New operator postulated to enable the parallel sampling capability.

pith-pipeline@v0.9.0 · 5480 in / 1327 out tokens · 41528 ms · 2026-05-15T01:42:24.365091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

  1. [1]

    D., Ho, J., Tarlow, M., and van den Berg, R

    Austin, J., Johnson, D. D., Ho, J., Tarlow, M., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, volume 34, pp.\ 17981--17993, 2021

  2. [2]

    and Brenier, Y

    Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84 0 (3): 0 375--393, 2000

  3. [3]

    Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures

    Bertucci, C. Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures. Annales de l'Institut Henri Poincar \'e C, Analyse non lin \'e aire , 2023. URL https://api.semanticscholar.org/CorpusID:259095954

  4. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pp.\ 1877--1901, 2020

  5. [5]

    and Lewis, A

    Bullo, F. and Lewis, A. D. Geometric control of mechanical systems. 2004. URL https://api.semanticscholar.org/CorpusID:679624

  6. [6]

    T., and Robinson, T

    Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

  7. [7]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., and Han, S. Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

  8. [8]

    Categorical flow matching on statistical manifolds

    Cheng, C., Li, J., Peng, J., and Liu, G. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37: 0 54787--54819, 2024

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Quora question pairs

    DataCanary, hilfialkaff, Jiang, L., Risdal, M., Dandekar, N., and tomtung. Quora question pairs. Kaggle Competition, 2017. https://kaggle.com/competitions/quora-question-pairs

  11. [11]

    Dhingra, B., Mazaitis, K., and Cohen, W. W. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904, 2017

  12. [12]

    H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

    Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

  13. [13]

    Fleming, W. H. and Rishel, R. W. Deterministic and stochastic optimal control. Springer Science & Business Media, 2012

  14. [14]

    and Cohen, V

    Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  15. [15]

    Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models

    Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models. In The 2023 Conference on Empirical Methods in Natural Language Processing

  16. [16]

    Diffuseq: Sequence to sequence text generation with diffusion models

    Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations (ICLR 2023)(01/05/2023-05/05/2023, Kigali, Rwanda), 2023

  17. [17]

    Scaling diffusion language models via adaptation from autoregressive models

    Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024

  18. [18]

    and Hashimoto, T

    Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36: 0 16693--16715, 2023

  19. [19]

    Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

    Han, X., Kumar, S., and Tsvetkov, Y. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11575--11596, 2023

  20. [20]

    Argmax flows and multinomial diffusion: Learning categorical distributions

    Hoogeboom, E., Nielsen, D., Jaini, P., Forr \'e , P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

  21. [21]

    K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al

    Huang, S., Cheng, T., Liu, J. K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al. Opencoder: The open cookbook for top-tier code large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 33167--33193, 2025

  22. [22]

    Neural crf model for sentence alignment in text simplification

    Jiang, C., Maddela, M., Lan, W., Zhong, Y., and Xu, W. Neural crf model for sentence alignment in text simplification. arXiv preprint arXiv:2005.02324, 2020

  23. [23]

    and Hwang, S

    Jo, J. and Hwang, S. J. Continuous diffusion model for language modeling. In Neural Information Processing Systems, 2025

  24. [24]

    Infinity instruct: Scaling instruction selection and synthesis to enhance language models

    Li, J., Du, L., Zhao, H., Zhang, B.-w., Wang, L., Gao, B., Liu, G., and Lin, Y. Infinity instruct: Scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116, 2025 a

  25. [25]

    Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation

    Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., and Kuen, J. Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025 b

  26. [26]

    S., and Hashimoto, T

    Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

  27. [27]

    T., Ben-Hamu, H., Nickel, M., and Le, M

    Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

  28. [28]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023

  29. [29]

    K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M

    Mahabadi, R. K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M. E., and Cohan, A. Tess: Text-to-text self-conditioned simplex diffusion. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2347--2361, 2024

  30. [30]

    Pointer sentinel mixture models, 2016

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016

  31. [31]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Moshkov, I., Hanley, D., Sorokin, I., Toshniwal, S., Henkel, C., Schifferer, B., Du, W., and Gitman, I. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

  32. [32]

    Large language diffusion models

    Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. In Neural Information Processing Systems, 2025 a

  33. [33]

    Large Language Diffusion Models

    Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

  34. [34]

    arXiv preprint arXiv:2406.03736 , year=

    Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

  35. [35]

    The lambada dataset: Word prediction requiring a broad discourse context

    Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 1525--1534, 2016

  36. [36]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  37. [37]

    Simple and effective masked diffusion language models

    Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., and Kuleshov, V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

  38. [38]

    Simplified and generalized masked diffusion for discrete data

    Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

  39. [39]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

  40. [40]

    Self-conditioned embedding diffusion for text generation

    Strudel, R., Tallec, C., Altch \'e , F., Du, Y., Ganin, Y., Mensch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre, L., et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022

  41. [41]

    arXiv preprint arXiv:2211.16750 , year=

    Sun, H., Yu, L., Dai, B., Schuurmans, D., and Dai, H. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022

  42. [42]

    Unified multimodal discrete diffusion

    Swerdlow, A., Prabhudesai, M., Gandhi, S., Pathak, D., and Fragkiadaki, K. Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853, 2025

  43. [43]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi \`e re, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  44. [44]

    Qwen2 Technical Report

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K.-Y., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., ...

  45. [45]

    Mmada: Multimodal large diffusion language models

    Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., and Wang, M. Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809, 2025

  46. [46]

    Dream 7B: Diffusion Large Language Models

    Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025

  47. [47]

    Commonsense knowledge aware conversation generation with graph attention

    Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., and Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, volume 18, pp.\ 4623--4629, 2018