arxiv: 2605.06885 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

Fred Zhangzhi Peng , Alexis Fox , Anru R. Zhang , Alexander Tong

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion language modelsautoregressive language modelsrepresentation alignmentmodel adaptationmasked denoisingtraining efficiencylow-data regimes

0 comments

The pith

Aligning hidden states lets diffusion language models reuse autoregressive representations and train up to 4x faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether much of the semantic structure from autoregressive pretraining can transfer to diffusion language models instead of being relearned during conversion. It tests this by adding a representation alignment loss that matches layer-wise hidden states between a frozen autoregressive model and the diffusion model via cosine similarity, while still optimizing the masked denoising objective. The method requires no adapters and only a change to the attention mask. If the hypothesis holds, it reframes diffusion training as mainly learning the new generation order rather than rebuilding language understanding from zero, which would matter most when data or compute is limited.

Core claim

The central claim is that aligning the hidden states of a bidirectional masked diffusion model to those of a pretrained autoregressive model of identical architecture, using cosine similarity at every layer, transfers semantic structure across generation orders. This lets the diffusion model focus on learning the decoding path. The resulting REPR-ALIGN procedure accelerates training and improves sample efficiency without extra parameters.

What carries the argument

REPR-ALIGN, a layer-wise cosine similarity loss between the hidden states of the frozen autoregressive model and the diffusion model, added to the standard masked denoising objective.

If this is right

Diffusion language models can reach target performance with up to four times fewer training steps.
The speedup is largest in low-data regimes where full retraining would otherwise be expensive.
No architectural modifications or added modules are required beyond switching to bidirectional attention.
Linguistic representations learned under autoregressive training can transfer across different generation orders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing large autoregressive checkpoints could serve as starting points for diffusion variants, reducing the need to repeat expensive pretraining for new generation paradigms.
The same alignment idea might let practitioners switch between autoregressive and diffusion modes within a single model family without full retraining.
If the transfer holds, it implies that core language capabilities are largely decoupled from the specific order in which tokens are generated during training.

Load-bearing premise

That the internal representations learned by next-token prediction contain semantic structure that remains useful when generation shifts to masked diffusion.

What would settle it

Training an identical diffusion model from the same starting point but without the cosine alignment loss and checking whether it requires substantially more steps to reach the same performance on the same data.

Figures

Figures reproduced from arXiv: 2605.06885 by Alexander Tong, Alexis Fox, Anru R. Zhang, Fred Zhangzhi Peng.

**Figure 1.** Figure 1: Don’t retrain—align. Left: REPR-ALIGN consistently accelerates AR→DLM adaptation on HumanEval pass@10, outperforming both AR fine-tuning and scratch training throughout early conversion. Right: The resulting oDLM achieves a favorable HumanEval pass@10 versus training-data trade-off among public DLMs. ∗Correspondence to: zhangzhi.peng@duke.edu Preprint. arXiv:2605.06885v1 [cs.LG] 7 May 2026 [PITH_FULL_IMAG… view at source ↗

**Figure 2.** Figure 2: Overview of our method REPR-ALIGN: we adapt a pretrained autoregressive (AR) transformer into a masked diffusion language model (DLM) by switching to bidirectional attention and training with a masked denoising objective, while anchoring layer-wise hidden states to a frozen AR backbone. 1 Introduction The dominant paradigm in large-scale language modeling has long been autoregressive (AR) sequence modeling… view at source ↗

**Figure 3.** Figure 3: REPR-ALIGN improves both adaptation speed and final quality. Left: HumanEval pass@10 vs. training steps for Qwen3-0.6B during AR→DLM conversion; adding representation alignment to the frozen AR teacher improves sample efficiency throughout training. Right: pass@10 results for 0.6B and 1.7B models; representation alignment provides larger gains at 1.7B than at 0.6B. model capacity: the absolute improvement … view at source ↗

**Figure 4.** Figure 4: Freezing improves training efficiency with a mild performance gain (1.7B). All runs use representation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Alignment is not data-hungry: a tiny subset can improve conversion. The [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open-dLLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The alignment trick gives reported training speedups but the causal vs bidirectional state mismatch makes the semantic-transfer story hard to accept without better controls.

read the letter

The paper's core move is freezing a pretrained autoregressive model and adding a layer-wise cosine similarity loss on hidden states while training an identical-architecture diffusion LM on the masked denoising objective. They report up to 4x faster convergence, especially in low-data regimes, with no adapters or architecture changes beyond the attention mask. That is the concrete new piece: an explicit representation-alignment objective that tries to preserve AR geometry during the switch to bidirectional diffusion training. The experiments appear to deliver the claimed acceleration in their setting, and releasing the code is helpful for checking the details. The hypothesis that DLM training mostly relearns the decoding path rather than the representations is stated clearly and tested in a straightforward way. The stress-test concern about incompatible contexts lands. AR hidden states at position i only aggregate tokens up to i, while the DLM states see the full unmasked sequence, so the vectors being aligned encode different dependency structures. Any observed speedup could therefore come from the extra cosine loss supplying additional gradient signal or regularization instead of genuine order-independent semantic reuse. The paper would be tighter with an ablation that aligns to a random or bidirectional reference model to isolate the effect. No circularity or fitting issues, and the method stays simple. This is useful for people already training or adapting diffusion language models who want a low-overhead conversion path from AR checkpoints. A reader focused on practical efficiency tricks would get value from the low-data results even if the deeper interpretation stays open. It is solid enough on the empirical side to deserve peer review so the mechanism can be clarified.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes REPR-ALIGN, a simple representation alignment method to adapt pretrained autoregressive (AR) language models to diffusion language models (DLMs). It freezes a causal AR model of identical architecture and aligns its per-layer hidden states to the bidirectional DLM states via cosine similarity while optimizing the standard masked denoising objective. The central hypothesis is that much of the semantic structure from next-token prediction transfers across generation orders, so DLM training reduces to learning a new decoding path. The authors report that this yields up to 4x training acceleration (particularly effective in low-data regimes) with no adapters or architectural changes beyond the attention mask, and they release code at https://github.com/pengzhangzhi/Open-dLLM.

Significance. If the empirical acceleration and low-data benefits hold under rigorous controls, the work would be significant for reducing compute in DLM training by reusing AR representations. It offers a lightweight alternative to full continued pretraining or adapter-based conversion, and the open code supports reproducibility. The hypothesis that linguistic geometry is largely order-independent could influence future work on cross-paradigm transfer in generative models.

major comments (2)

[§2–3 (Hypothesis and REPR-ALIGN)] The core hypothesis (§2 and §3) that AR hidden states encode order-independent semantic structure that can be directly reused by a bidirectional DLM is load-bearing for interpreting the reported acceleration as representation transfer rather than auxiliary regularization. Because AR states at position i are computed under a causal mask (tokens 1..i only) while DLM states use bidirectional context (full sequence minus masks), the cosine alignment necessarily operates on incompatible dependency structures. This risks confounding the 4x speedup claim; a control aligning the DLM to a randomly initialized or shuffled AR model would be required to isolate genuine transfer.
[Abstract and Experiments section] The abstract and results claim 'up to 4x training acceleration' and particular effectiveness in low-data regimes, yet the manuscript provides no quantitative tables, baseline comparisons (e.g., standard DLM training from scratch or with adapters), ablation of the cosine term, statistical significance, or exact settings (model size, dataset, steps). Without these, the central empirical claim cannot be evaluated and the low-data benefit remains unverified.

minor comments (2)

[§3] The alignment loss is described only in prose; adding an explicit equation (e.g., L_align = sum_l (1 - cos(h_DLM^l, h_AR^l))) would improve clarity and allow readers to see the weighting relative to the denoising loss.
[Figures] Figure captions and axis labels should explicitly state the y-axis metric (e.g., validation loss or perplexity) and the exact comparison baseline for the '4x' curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical support and clarify the hypothesis without altering the core claims.

read point-by-point responses

Referee: The core hypothesis (§2 and §3) that AR hidden states encode order-independent semantic structure that can be directly reused by a bidirectional DLM is load-bearing for interpreting the reported acceleration as representation transfer rather than auxiliary regularization. Because AR states at position i are computed under a causal mask (tokens 1..i only) while DLM states use bidirectional context (full sequence minus masks), the cosine alignment necessarily operates on incompatible dependency structures. This risks confounding the 4x speedup claim; a control aligning the DLM to a randomly initialized or shuffled AR model would be required to isolate genuine transfer.

Authors: We agree that the differing masks create a potential confound and that the hypothesis would be strengthened by explicit controls. While the alignment objective is applied to the same layer indices and the DLM still optimizes the denoising loss, we will add a control experiment in the revised Section 4 that aligns the DLM to a randomly initialized AR model of identical architecture. We expect this to produce substantially weaker acceleration, isolating the contribution of the pretrained representations. This addition will be accompanied by discussion of the dependency mismatch. revision: yes
Referee: The abstract and results claim 'up to 4x training acceleration' and particular effectiveness in low-data regimes, yet the manuscript provides no quantitative tables, baseline comparisons (e.g., standard DLM training from scratch or with adapters), ablation of the cosine term, statistical significance, or exact settings (model size, dataset, steps). Without these, the central empirical claim cannot be evaluated and the low-data benefit remains unverified.

Authors: We apologize for the insufficient detail in the initial submission. The experiments section contains some comparisons, but we will expand it substantially. The revision will include: (i) full quantitative tables with training curves and final metrics versus from-scratch DLM training and adapter baselines; (ii) an ablation removing the cosine alignment term; (iii) results over multiple random seeds with error bars and significance tests; and (iv) exact specifications for model sizes, datasets, batch sizes, and step counts. These changes will make the 4x acceleration and low-data claims directly evaluable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical alignment procedure with external frozen model

full rationale

The paper defines REPR-ALIGN as the sum of the standard masked denoising loss and a cosine-similarity term between DLM hidden states and those of a separately pretrained, frozen AR model. Reported speedups and low-data gains are measured outcomes of training runs, not quantities that reduce by construction to fitted constants or to the alignment objective itself. No equations, predictions, or uniqueness claims are shown to collapse into self-referential definitions or self-citation chains. The transfer hypothesis is stated as a testable assumption and evaluated experimentally rather than smuggled in via prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hidden-state geometry is largely invariant to generation order and that cosine similarity is a sufficient metric for transferring that geometry.

axioms (1)

domain assumption Hidden-state representations learned by next-token prediction contain semantic structure that is largely independent of the generation order used at inference time.
This premise is invoked to justify why aligning to a frozen AR model should accelerate DLM training rather than require relearning language.

pith-pipeline@v0.9.0 · 5563 in / 1264 out tokens · 44857 ms · 2026-05-11T01:21:42.898376+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linguistic representations can transfer across generation order

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[2]

International Conference on Learning Representations , year =

Scaling Diffusion Language Models via Adaptation from Autoregressive Models , author =. International Conference on Learning Representations , year =

work page
[3]

International Conference on Learning Representations , year =

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think , author =. International Conference on Learning Representations , year =

work page
[4]

What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

What Matters for Representation Alignment: Global Information or Spatial Structure? , author =. arXiv preprint arXiv:2512.10794 , year =

work page arXiv
[5]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think , author =. arXiv preprint arXiv:2507.01467 , year =

work page arXiv
[6]

No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves , author =. arXiv preprint arXiv:2505.02831 , year =

work page arXiv
[7]

2025 , journal =

DPLM-2: A Multimodal Diffusion Protein Language Model , author =. 2025 , journal =

work page 2025
[8]

2025 , journal =

Scaling up Masked Diffusion Models on Text , author =. 2025 , journal =

work page 2025
[9]

Nature , pages =

Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author =. Nature , pages =. 2024 , publisher =

work page 2024
[10]

Algorithms for molecular biology , volume =

ViennaRNA Package 2.0 , author =. Algorithms for molecular biology , volume =. 2011 , publisher =

work page 2011
[11]

Bioinformatics , volume =

Forna (force-directed RNA): Simple and effective online RNA secondary structure diagrams , author =. Bioinformatics , volume =. 2015 , publisher =

work page 2015
[12]

Nucleic acids research , volume =

RNAcentral 2021: secondary structure integration, improved sequence search and new member databases , author =. Nucleic acids research , volume =. 2021 , publisher =

work page 2021
[13]

Nature Methods , pages =

Accurate RNA 3D structure prediction using a language model-based deep learning approach , author =. Nature Methods , pages =. 2024 , publisher =

work page 2024
[14]

, title =

Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[15]

arXiv , year =

Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks , author =. arXiv , year =

work page
[16]

International Conference on Learning Representations , year =

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author =. International Conference on Learning Representations , year =

work page
[17]

ArXiv , year =

DPLM-2: A Multimodal Diffusion Protein Language Model , author =. ArXiv , year =

work page
[18]

International Conference on Machine Learning , year =

Diffusion Language Models Are Versatile Protein Learners , author =. International Conference on Machine Learning , year =

work page
[19]

arXiv , year =

A Reparameterized Discrete Diffusion Model for Text Generation , author =. arXiv , year =

work page
[20]

International Conference on Learning Representations , year =

Think While You Generate: Discrete Diffusion with Planned Denoising , author =. International Conference on Learning Representations , year =

work page
[21]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

Simple and Effective Masked Diffusion Language Models , author =. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

work page
[22]

Sofroniew and Deniz Oktay and Zeming Lin and Robert Verkuil and Vincent Q

Thomas Hayes and Roshan Rao and Halil Akin and Nicholas J. Sofroniew and Deniz Oktay and Zeming Lin and Robert Verkuil and Vincent Q. Tran and Jonathan Deaton and Marius Wiggert and Rohil Badkundri and Irhum Shafkat and Jun Gong and Alexander Derry and Raul S. Molina and Neil Thomas and Yousuf A. Khan and Chetan Mishra and Carolyn Kim and Liam J. Bartie a...

work page
[23]

Science , volume =

Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives , title =. Science , volume =

work page
[24]

arXiv , year =

The LAMBADA dataset: Word prediction requiring a broad discourse context , author =. arXiv , year =

work page
[25]

International Conference on Machine Learning , year =

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author =. International Conference on Machine Learning , year =

work page
[26]

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =

work page
[27]

arXiv , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv , year =

work page
[28]

2025 , eprint=

Path Planning for Masked Diffusion Model Sampling , author=. 2025 , eprint=

work page 2025
[29]

arXiv , year =

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , author =. arXiv , year =

work page
[30]

arXiv , year =

Efficient Training of Language Models to Fill in the Middle , author =. arXiv , year =

work page
[31]

2004 , address =

Lin, Chin-Yew , booktitle =. 2004 , address =

work page 2004
[32]

arXiv , year =

Lessons from the Trenches on Reproducible Evaluation of Language Models , author =. arXiv , year =

work page
[33]

Neural Information Processing Systems , year =

Likelihood-Based Diffusion Language Models , author =. Neural Information Processing Systems , year =

work page
[34]

ArXiv , year =

TinyLlama: An Open-Source Small Language Model , author =. ArXiv , year =

work page
[35]

arXiv preprint arXiv:2311.07468 , year =

Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse , author =. arXiv preprint arXiv:2311.07468 , year =

work page arXiv
[36]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages =

Bleu: a method for automatic evaluation of machine translation , author =. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages =

work page
[37]

a is b" fail to learn

The reversal curse: Llms trained on "a is b" fail to learn "b is a" , author =. arXiv , year =

work page
[38]

arXiv , year =

Llama 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv , year =

work page
[39]

2019 , journal =

Language Models are Unsupervised Multitask Learners , author =. 2019 , journal =

work page 2019
[40]

Cell systems , year =

ProGen2: Exploring the Boundaries of Protein Language Models , author =. Cell systems , year =

work page
[41]

bioRxiv , year =

Protein generation with evolutionary diffusion: sequence is all you need , author =. bioRxiv , year =

work page
[42]

2024 , journal =

Informed Correctors for Discrete Diffusion Models , author =. 2024 , journal =

work page 2024
[43]

International Conference on Learning Representations , year =

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author =. International Conference on Learning Representations , year =

work page
[44]

Nature Biotechnology , year =

Efficient evolution of human antibodies from general protein language models and sequence information alone , author =. Nature Biotechnology , year =

work page
[45]

International Conference on Learning Representations , year =

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author =. International Conference on Learning Representations , year =

work page
[46]

2019 , journal =

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. 2019 , journal =

work page 2019
[47]

North American Chapter of the Association for Computational Linguistics , year =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. North American Chapter of the Association for Computational Linguistics , year =

work page
[48]

2025 , eprint =

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. 2025 , eprint =

work page 2025
[49]

2024 , journal =

How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework , author =. 2024 , journal =

work page 2024
[50]

2013 , publisher =

Limit theorems for stochastic processes , author =. 2013 , publisher =

work page 2013
[51]

International Conference on Learning Representations , year =

Score-based Continuous-time Discrete Diffusion Models , author =. International Conference on Learning Representations , year =

work page
[52]

2024 , journal =

Discrete Flow Matching , author =. 2024 , journal =

work page 2024
[53]

Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =

Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =. arXiv , year =

work page
[54]

2022 , journal =

A Continuous Time Framework for Discrete Denoising Models , author =. 2022 , journal =

work page 2022
[55]

George and Zhang, Qing , year =

Yin, G. George and Zhang, Qing , year =. Continuous-Time Markov Chains and Applications , volume =

work page
[56]

2025 , journal =

Simple Guidance Mechanisms for Discrete Diffusion Models , author =. 2025 , journal =

work page 2025
[57]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

Benton, Joe and Shi, Yuyang and De Bortoli, Valentin and Deligiannidis, George and Doucet, Arnaud , title =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

work page
[58]

2024 , journal =

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data , author =. 2024 , journal =

work page 2024
[59]

Proceedings of the 31th International Conference on Machine Learning , year =

Benigno Uria and Iain Murray and Hugo Larochelle , title =. Proceedings of the 31th International Conference on Machine Learning , year =

work page
[60]

Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =

Emiel Hoogeboom and Alexey A. Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =. 10th International Conference on Learning Representations , year =

work page
[61]

Analysis and Approximation of Rare Events: Representations and Weak Convergence Methods , volume =

Budhiraja, Amarjit and Dupuis, Paul , year =. Analysis and Approximation of Rare Events: Representations and Weak Convergence Methods , volume =

work page
[62]

2022 , journal =

Training and Inference on Any-Order Autoregressive Models the Right Way , author =. 2022 , journal =

work page 2022
[63]

2021 , journal =

Discovering Non-monotonic Autoregressive Orderings with Variational Inference , author =. 2021 , journal =

work page 2021
[64]

1976 , issn =

A general method for numerically simulating the stochastic time evolution of coupled chemical reactions , journal =. 1976 , issn =

work page 1976
[65]

The Journal of Physical Chemistry , author =

Exact stochastic simulation of coupled chemical reactions , volume =. The Journal of Physical Chemistry , author =. 1977 , pages =

work page 1977
[66]

arXiv , year =

Simplified and Generalized Masked Diffusion for Discrete Data , author =. arXiv , year =

work page
[67]

arXiv , year =

Unified Discrete Diffusion for Categorical Data , author =. arXiv , year =

work page
[68]

Neural Information Processing Systems , year =

Variational Flow Matching for Graph Generation , author =. Neural Information Processing Systems , year =

work page
[69]

Informed Correctors for Discrete Diffusion Models , author =

work page
[70]

2025 , journal =

Remasking Discrete Diffusion Models with Inference-Time Scaling , author =. 2025 , journal =

work page 2025
[71]

2025 , journal =

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author =. 2025 , journal =

work page 2025
[72]

2025 , journal =

Learning-Order Autoregressive Models with Application to Molecular Graph Generation , author =. 2025 , journal =

work page 2025
[73]

2024 , eprint =

Improving and generalizing flow-based generative models with minibatch optimal transport , author =. 2024 , eprint =

work page 2024
[74]

Advances in neural information processing systems , volume =

Denoising diffusion probabilistic models , author =. Advances in neural information processing systems , volume =

work page
[75]

arXiv , year =

Gpt-4 technical report , author =. arXiv , year =

work page
[76]

Nature , volume =

De novo design of protein structure and function with RFdiffusion , author =. Nature , volume =. 2023 , publisher =

work page 2023
[77]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

High-resolution image synthesis with latent diffusion models , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page
[78]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

work page 2015
[79]

2025 , eprint =

Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion , author =. 2025 , eprint =

work page 2025
[80]

The Fourteenth International Conference on Learning Representations , year=

Planner Aware Path Learning in Diffusion Language Models Training , author=. The Fourteenth International Conference on Learning Representations , year=

work page

Showing first 80 references.