arxiv: 2605.14531 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

ZiYi Dong , Yuliang Huang , Weijian Deng , Xiangyang Ji , Liang Lin , Pengxu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords language generationstochastic optimal controlHamilton-Jacobi-Bellman equationflow matchingdiffusion modelslatent control spaceparallel samplingManta-LM

0 comments

The pith

Reformulating language generation as stochastic optimal control lets flow matching approximate the HJB equation in latent space, yielding models with both high fidelity and parallel sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper casts language generation as a stochastic optimal control problem to unify the analysis of autoregressive and diffusion models. It identifies their shared limitations as arising from trajectory singularity, adjoint state vanishing, and gradient absence. Approximating the Hamilton-Jacobi-Bellman equation with flow matching inside a rectified latent control space allows the Manta-LM model and its global integral operator to approximate the global vector field. A reader would care because the approach promises to break the efficiency-fidelity tradeoff while keeping sampling cheap and parallel.

Core claim

By reformulating language generation as a stochastic optimal control problem and approximating the solution to the Hamilton-Jacobi-Bellman equation with Flow Matching as the optimal trajectory solver within the rectified latent control space, Manta-LM with Global Integral Operator approximates the global vector field and thereby realizes a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling.

What carries the argument

Flow Matching acting as the optimal trajectory solver inside the rectified latent control space, with the Global Integral Operator in Manta-LM used to approximate the global vector field for the closed-loop controller.

If this is right

The Efficiency-Fidelity Paradox, Irreversibility Error Propagation, and Optimization Tractability issues are bypassed.
High-fidelity text generation becomes compatible with efficient, low-cost parallel sampling.
Improved stability, efficiency, and controllability appear on language modeling and conditional generation tasks.
Direct solution of the HJB PDE is avoided while still obtaining an optimal policy that acts as a closed-loop controller.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-control reformulation could be tested on non-text sequence tasks such as code or protein generation.
The closed-loop controller structure may support real-time steering of generation outputs without retraining.
If the global vector field approximation holds, similar flow-matching solvers could be tried on other intractable optimal-control problems in machine learning.

Load-bearing premise

That flow matching inside the rectified latent control space can reliably approximate the solution to the Hamilton-Jacobi-Bellman equation and thereby resolve trajectory singularity, adjoint state vanishing, and gradient absence.

What would settle it

An experiment in which parallel sampling with the trained Manta-LM model produces measurably lower fidelity than sequential baselines on the same language modeling or conditional generation benchmarks.

Figures

Figures reproduced from arXiv: 2605.14531 by Liang Lin, Pengxu Wei, Weijian Deng, Xiangyang Ji, Yuliang Huang, ZiYi Dong.

**Figure 1.** Figure 1: Generation dynamics. On a non-convex manifold, (a) AR and Diffusion are trapped in a slow, myopic crawl along the high-curvature density ridge. (b) In contrast, our method approximates the global optimal trajectory, bypassing curvature via the rectified latent geometry (energy-minimizing geodesic) for improved efficiency. the optimal controller in Equation (3), targeting high data fidelity with low infer… view at source ↗

**Figure 2.** Figure 2: Visualizing Generative Dynamics and Error Propagation on BVP task. Color from pink to blue denotes generation progress. (a) AR suffers from compounding errors (blue lines in (d)) due to open-loop myopia, drifting off-manifold. (b) Discrete DLM relies on stochastic combinatorial search showing jagged trajectories caused by geometric blindness (lack of gradients). (c) Our Manta-LM acts as an optimal closed… view at source ↗

**Figure 3.** Figure 3: Geometric comparison. Unlike (a) Autoregressive models’ serial paths or (b-c) Diffusion baselines’ high-curvature trajectories in ill-conditioned spaces, (d) Ours operates on a rectified latent manifold. The learned optimal vector field vθ enables energy-minimizing, straight-line transport from noise to data [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency evaluation with inference throughput. transforming the ill-conditioned high-frequency regression problem into a well-conditioned one, thereby making the HJB-inspired dynamics easier to approximate with Flow Matching and efficient large-step integration. Geometric Regularity and Optimization Stability. Figure 6 contrasts the rugged optimization landscape of discrete baselines, which reflects sev… view at source ↗

**Figure 5.** Figure 5: Stiffness Analysis. The raw Token (Embedding) Space exhibits extreme stiffness and high curvature, indicating an illconditioned control landscape that forces adaptive solvers (RK45) to high NFE. In contrast, our Rectified Latent Space maintains low stiffness and near-linear trajectories, verifying the efficacy of VAE. (a) Auto-Regressive (GPT-2) (b) Discrete Diffusion (RADD) (c) Manta-LM (Ours) [PITH_FUL… view at source ↗

**Figure 6.** Figure 6: Optimization landscapes of different generation paradigms. (a) AR exhibits sharp and unstable geometry. (b) Discrete diffusion leads to fragmented and irregular landscapes. (c) Our Manta-LM yields a smooth and well-conditioned landscape, enabling stable optimization. 6. Conclusion We presented Manta-LM, a framework that studies and reimagines text generation as Stochastic Optimal Control problem. By app… view at source ↗

**Figure 7.** Figure 7: Model structure and pipeline. due to the high-frequency discontinuities of the semantic energy landscape V across discrete tokens, no single vector v can satisfy the first-order Taylor approximation for the local neighborhood, formally implying that ∄ v such that V (z + d) − V (z) ≈ ⟨v, d⟩ holds, thereby confirming the structural absence of gradient guidance. ∄ v ∈ TzM s.t. ⟨v, d⟩ ≈ V (z + d) − V (z). (24)… view at source ↗

**Figure 8.** Figure 8: Analysis on interplay between CFG guidance strength and integration fidelity • Optimal Regime (w ∈ [3.0, 5.0]): This setting achieves the best quality-efficiency trade-off. Metrics saturate rapidly (within 20–30 steps), indicating that the vector field is sufficiently aligned with the condition while remaining smooth enough for coarse-step integration. • Over-Guided Regime (w ≥ 7.0): We observe a sharp per… view at source ↗

**Figure 9.** Figure 9: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “what was the best day of your life, and what happened?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “how can i be a good geologist?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Visualizing error correction capabilities across different models. Red text indicates corrupted or erroneous tokens introduced by noise. while yellow text denotes tokens that are semantically consistent with the ground-truth text but differ in surface form. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. Rabat – Dutch far-right lawmaker Geert Wilders was found not guilty of hate speech but guilty of discrimination and group insult. He will face no punishment. The verdict is in reference to comments Wilders, the leader of the Freedom Party, made… view at source ↗

**Figure 13.** Figure 13: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a control-theoretic unification of autoregressive and diffusion language models but the key step equating flow matching in rectified latent space to an HJB solution is asserted rather than derived.

read the letter

The new angle here is casting language generation as a stochastic optimal control problem and then using that to diagnose the efficiency-fidelity trade-offs in existing models through trajectory singularity, adjoint vanishing, and absent gradients. From there they approximate the HJB equation via flow matching inside a rectified latent control space, introduce a Global Integral Operator, and claim the resulting Manta-LM gives both high-fidelity output and cheap parallel sampling. That framing is coherent and pulls together ideas that usually sit in separate literatures, which is the main thing worth noting. The abstract states the motivation cleanly and the empirical summary claims gains on standard language modeling and conditional tasks plus better stability. Those are the parts that land. The soft spot is exactly where the stress-test flagged: there is no derivation shown that the flow-matching objective recovers or sufficiently approximates the HJB value function, no error bound on the resulting policy, and no verification that the closed-loop controller remains optimal once the latent rectification and integral operator are added. Without that link the resolution of the three listed limitations stays motivational rather than demonstrated. The paper is aimed at people who already work at the intersection of optimal control and generative modeling; a reader comfortable with HJB equations and flow matching will get the most out of it. It is worth sending to peer review because the perspective is fresh and the empirical direction is concrete, but any referee will need to see the missing derivations and tighter validation before the central claim can be taken as established.

Referee Report

3 major / 2 minor

Summary. The paper reformulates language generation as a stochastic optimal control problem, using this lens to unify autoregressive and diffusion models and attribute their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) to trajectory singularity, adjoint state vanishing, and gradient absence. It approximates the Hamilton-Jacobi-Bellman equation via Flow Matching as the trajectory solver inside a rectified latent control space, introducing Manta-LM equipped with a Global Integral Operator to approximate the global vector field. The resulting closed-loop controller is claimed to deliver both high-fidelity generation and efficient parallel sampling, with supporting empirical results on language modeling and conditional generation tasks.

Significance. If the Flow Matching step in rectified latent space can be shown to approximate the HJB solution with quantifiable error and to preserve optimality of the resulting policy, the work would supply a principled control-theoretic unification of existing generative paradigms and a concrete route to simultaneous fidelity and sampling efficiency. The closed-loop formulation and latent rectification are technically interesting and could influence future controllable generation research, provided the central approximation is rigorously justified.

major comments (3)

[§4] §4 (Method): No derivation is supplied that equates the Flow Matching objective (or its rectified latent variant) to the HJB value function, nor are error bounds or optimality guarantees provided for the resulting policy once the Global Integral Operator is introduced. The claim that this construction resolves trajectory singularity, adjoint vanishing, and gradient absence therefore rests on an unproven modeling assumption rather than a demonstrated reduction.
[§3.2] §3.2 (Reformulation): The mapping from the standard autoregressive or diffusion training objective to the stochastic control problem and its associated HJB PDE is stated at a high level but not derived step-by-step; it is therefore unclear whether the three listed paradoxes are formally equivalent to the cited singularities or merely heuristically linked.
[Experiments] Experiments section: Reported gains on language modeling and conditional generation are not accompanied by ablations that isolate the contribution of the HJB approximation from the latent rectification or the integral operator; without such controls it is impossible to verify that performance improvements stem from the claimed optimal-control mechanism.

minor comments (2)

[§4.1] Notation for the Global Integral Operator and the rectified latent control space should be introduced with explicit equations before being used in the main claims.
[Abstract] The abstract and introduction repeatedly use the phrase 'approximate the global vector field' without defining the vector field or the sense in which the approximation is global; a clarifying sentence or equation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the theoretical derivations and experimental analysis without altering the core claims of the work.

read point-by-point responses

Referee: [§4] §4 (Method): No derivation is supplied that equates the Flow Matching objective (or its rectified latent variant) to the HJB value function, nor are error bounds or optimality guarantees provided for the resulting policy once the Global Integral Operator is introduced. The claim that this construction resolves trajectory singularity, adjoint vanishing, and gradient absence therefore rests on an unproven modeling assumption rather than a demonstrated reduction.

Authors: We agree that the connection between Flow Matching in rectified latent space and the HJB equation requires a more explicit derivation. In the revised manuscript we will add a dedicated subsection deriving the equivalence of the Flow Matching objective to the HJB value function under the rectified control-space formulation, including first-order error bounds on the approximation and a discussion of the conditions under which the resulting policy remains optimal. This will replace the current high-level statement with a rigorous reduction. revision: yes
Referee: [§3.2] §3.2 (Reformulation): The mapping from the standard autoregressive or diffusion training objective to the stochastic control problem and its associated HJB PDE is stated at a high level but not derived step-by-step; it is therefore unclear whether the three listed paradoxes are formally equivalent to the cited singularities or merely heuristically linked.

Authors: We will expand §3.2 with a complete step-by-step derivation that starts from the standard autoregressive and diffusion objectives, maps them onto the stochastic optimal control problem, and arrives at the associated HJB PDE. The revised text will explicitly show how the Efficiency-Fidelity Paradox, Irreversibility Error Propagation, and Optimization Tractability issues arise as direct consequences of trajectory singularity, adjoint vanishing, and gradient absence, thereby establishing formal rather than heuristic equivalence. revision: yes
Referee: [Experiments] Experiments section: Reported gains on language modeling and conditional generation are not accompanied by ablations that isolate the contribution of the HJB approximation from the latent rectification or the integral operator; without such controls it is impossible to verify that performance improvements stem from the claimed optimal-control mechanism.

Authors: We will add a new ablation study in the Experiments section that systematically disables the HJB approximation (replacing it with standard flow matching), removes latent rectification, and removes the Global Integral Operator while keeping all other components fixed. The revised results will report the isolated contribution of each element to perplexity, generation speed, and controllability metrics, allowing direct verification that the gains originate from the optimal-control formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper reformulates language generation as stochastic optimal control, identifies limitations via trajectory singularity/adjoint vanishing/gradient absence, and approximates the HJB solution by employing Flow Matching as trajectory solver inside a rectified latent control space together with the Global Integral Operator. No quoted step reduces a claimed prediction or optimality result to a quantity defined by the paper's own fitted parameters or by self-citation alone; the construction introduces new modeling elements (rectified latent space, integral operator) presented as external approximations rather than tautological redefinitions. The derivation therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the standard HJB equation from optimal control theory and the effectiveness of the flow-matching approximation in a newly introduced latent space; new entities Manta-LM and Global Integral Operator are postulated without independent evidence.

axioms (1)

standard math The Hamilton-Jacobi-Bellman equation provides the optimal policy for the stochastic control formulation of language generation.
Invoked directly as the target equation whose solution is approximated.

invented entities (2)

Manta-LM no independent evidence
purpose: Proposed model realizing the closed-loop controller via global vector field approximation.
New model name and architecture introduced to implement the method.
Global Integral Operator no independent evidence
purpose: Component that approximates the global vector field in the rectified latent control space.
New operator postulated to enable the parallel sampling capability.

pith-pipeline@v0.9.0 · 5480 in / 1327 out tokens · 41528 ms · 2026-05-15T01:42:24.365091+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

J(u) = E[−log pθ(z1)] + λ ∫ E[½ ∥ut(zt)∥²] dt; u* = −∇z V(z,t) satisfying HJB
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Flow Matching regression LCFM = E ∥vθ(zt,t) − (z1 − z0)∥² as Lagrangian surrogate for HJB
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rectified latent manifold via VAE to obtain control-friendly Euclidean geometry

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

[1]

D., Ho, J., Tarlow, M., and van den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, M., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, volume 34, pp.\ 17981--17993, 2021

work page 2021
[2]

and Brenier, Y

Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84 0 (3): 0 375--393, 2000

work page 2000
[3]

Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures

Bertucci, C. Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures. Annales de l'Institut Henri Poincar \'e C, Analyse non lin \'e aire , 2023. URL https://api.semanticscholar.org/CorpusID:259095954

work page 2023
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pp.\ 1877--1901, 2020

work page 1901
[5]

and Lewis, A

Bullo, F. and Lewis, A. D. Geometric control of mechanical systems. 2004. URL https://api.semanticscholar.org/CorpusID:679624

work page 2004
[6]

T., and Robinson, T

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

work page 2013
[7]

Deep compression autoencoder for efficient high-resolution diffusion models

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., and Han, S. Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024
[8]

Categorical flow matching on statistical manifolds

Cheng, C., Li, J., Peng, J., and Liu, G. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37: 0 54787--54819, 2024

work page 2024
[9]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Quora question pairs

DataCanary, hilfialkaff, Jiang, L., Risdal, M., Dandekar, N., and tomtung. Quora question pairs. Kaggle Competition, 2017. https://kaggle.com/competitions/quora-question-pairs

work page 2017
[11]

Dhingra, B., Mazaitis, K., and Cohen, W. W. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

work page arXiv 2022
[13]

Fleming, W. H. and Rishel, R. W. Deterministic and stochastic optimal control. Springer Science & Business Media, 2012

work page 2012
[14]

and Cohen, V

Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[15]

Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models. In The 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[16]

Diffuseq: Sequence to sequence text generation with diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations (ICLR 2023)(01/05/2023-05/05/2023, Kigali, Rwanda), 2023

work page 2023
[17]

Scaling diffusion language models via adaptation from autoregressive models

Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024

work page arXiv 2024
[18]

and Hashimoto, T

Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36: 0 16693--16715, 2023

work page 2023
[19]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Han, X., Kumar, S., and Tsvetkov, Y. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11575--11596, 2023

work page 2023
[20]

Argmax flows and multinomial diffusion: Learning categorical distributions

Hoogeboom, E., Nielsen, D., Jaini, P., Forr \'e , P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

work page 2021
[21]

K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al

Huang, S., Cheng, T., Liu, J. K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al. Opencoder: The open cookbook for top-tier code large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 33167--33193, 2025

work page 2025
[22]

Neural crf model for sentence alignment in text simplification

Jiang, C., Maddela, M., Lan, W., Zhong, Y., and Xu, W. Neural crf model for sentence alignment in text simplification. arXiv preprint arXiv:2005.02324, 2020

work page arXiv 2005
[23]

and Hwang, S

Jo, J. and Hwang, S. J. Continuous diffusion model for language modeling. In Neural Information Processing Systems, 2025

work page 2025
[24]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models

Li, J., Du, L., Zhao, H., Zhang, B.-w., Wang, L., Gao, B., Liu, G., and Lin, Y. Infinity instruct: Scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116, 2025 a

work page arXiv 2025
[25]

Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation

Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., and Kuen, J. Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025 b

work page arXiv 2025
[26]

S., and Hashimoto, T

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

work page 2022
[27]

T., Ben-Hamu, H., Nickel, M., and Le, M

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

work page 2023
[28]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M

Mahabadi, R. K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M. E., and Cohan, A. Tess: Text-to-text self-conditioned simplex diffusion. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2347--2361, 2024

work page 2024
[30]

Pointer sentinel mixture models, 2016

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016

work page 2016
[31]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Moshkov, I., Hanley, D., Sorokin, I., Toshniwal, S., Henkel, C., Schifferer, B., Du, W., and Gitman, I. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025
[32]

Large language diffusion models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. In Neural Information Processing Systems, 2025 a

work page 2025
[33]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

arXiv preprint arXiv:2406.03736 , year=

Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[35]

The lambada dataset: Word prediction requiring a broad discourse context

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 1525--1534, 2016

work page 2016
[36]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[37]

Simple and effective masked diffusion language models

Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., and Kuleshov, V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

work page 2024
[38]

Simplified and generalized masked diffusion for discrete data

Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

work page 2024
[39]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

work page 2015
[40]

Self-conditioned embedding diffusion for text generation

Strudel, R., Tallec, C., Altch \'e , F., Du, Y., Ganin, Y., Mensch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre, L., et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022
[41]

arXiv preprint arXiv:2211.16750 , year=

Sun, H., Yu, L., Dai, B., Schuurmans, D., and Dai, H. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022
[42]

Unified multimodal discrete diffusion

Swerdlow, A., Prabhudesai, M., Gandhi, S., Pathak, D., and Fragkiadaki, K. Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853, 2025

work page arXiv 2025
[43]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi \`e re, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K.-Y., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Mmada: Multimodal large diffusion language models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., and Wang, M. Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809, 2025

work page arXiv 2025
[46]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Commonsense knowledge aware conversation generation with graph attention

Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., and Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, volume 18, pp.\ 4623--4629, 2018

work page 2018