Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Chongxuan Li; Fengqi Zhu; Jiacheng Sun; Jingyang Ou; Kaiwen Xue; Shen Nie; Zhenguo Li

arxiv: 2406.03736 · v4 · submitted 2024-06-06 · 💻 cs.LG · cs.CL

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou , Shen Nie , Kaiwen Xue , Fengqi Zhu , Jiacheng Sun , Zhenguo Li , Chongxuan Li This is my paper

Pith reviewed 2026-05-18 12:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords discrete diffusionabsorbing processesconcrete scorelanguage modelingany-order autoregressive modelsconditional distributionsreparameterization

0 comments

The pith

The concrete score in absorbing discrete diffusion equals conditional probabilities of clean data multiplied by an analytic time-dependent scalar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the key quantity tracked by absorbing discrete diffusion models—the concrete score—can be rewritten exactly as the conditional probability of the original clean sequence given the current noisy state, scaled by a closed-form factor that depends only on time. If this equivalence holds, diffusion training directly optimizes models of clean-data conditionals rather than requiring separate time embeddings. The authors therefore introduce RADD, a reparameterized network that drops time conditioning entirely and caches its outputs during sampling intervals where the noisy token stays fixed. They also prove that the diffusion negative-log-likelihood upper bound is equivalent to an expected negative-log-likelihood under any-order autoregressive modeling. This matters for language modeling because it simplifies architecture, speeds up inference, and reveals a direct bridge between two families of generative models.

Core claim

The concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, the authors propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Built upon the new perspective of conditional distributions, they further unify absorbing discrete diffusion and any-order autoregressive models, showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs.

What carries the argument

The concrete score, defined as the ratio of marginal probabilities between two transitive states, which the paper rewrites in closed form as the product of clean-data conditionals and a known time-dependent multiplier.

If this is right

RADD trains a network that outputs only time-independent conditional probabilities of clean data.
Sampling accelerates by reusing cached network outputs whenever the noisy token does not change within a diffusion interval.
The diffusion training objective supplies an upper bound that equals the expected negative log-likelihood of any-order autoregressive models.
RADD reaches state-of-the-art perplexity among diffusion models on five zero-shot language-modeling benchmarks at GPT-2 scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The time-independent view could let practitioners replace diffusion schedules with fixed-order autoregressive training while preserving the same loss surface.
Caching suggests that inference cost scales with the number of token changes rather than total diffusion steps, which may favor long-sequence generation.
If the unification is tight, likelihood evaluation techniques from autoregressive models might transfer directly to diffusion models without extra machinery.

Load-bearing premise

The derivation assumes the standard absorbing-state transition kernel so that the ratio of marginal probabilities takes the stated analytic form at every timestep.

What would settle it

Compute the concrete score and the proposed conditional-probability expression on a small fixed vocabulary and check whether the equality holds for every timestep and every pair of states; any systematic mismatch would refute the claim.

read the original abstract

Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval, which enables sampling acceleration. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The concrete score in absorbing diffusion factors as clean-data conditionals times an analytic time scalar, which lets them drop time conditioning and link the objective to AO-ARMs.

read the letter

The punchline is that absorbing discrete diffusion's concrete score can be rewritten as the conditional distribution over clean tokens times a closed-form time-dependent scalar. This comes directly from the absorbing kernel and lets them drop time conditioning from the network. What stands out is the RADD architecture that follows from this, which supports caching unchanged noisy samples to reduce function evaluations during sampling. They also show how the diffusion training objective upper-bounds an expected negative log-likelihood for any-order autoregressive models, giving a unification that wasn't spelled out before. The paper does well on the theory side: the identity is algebraic and holds for the standard transition probabilities without needing data assumptions or parameter fitting. Empirically they report better perplexity than prior diffusion models on several language modeling benchmarks at the scale of GPT-2, and the code is public so the numbers can be checked. The soft spots are minor. The sampling acceleration via caching is claimed but without detailed ablations on the schedule or wall-clock measurements, so the practical speedup isn't fully quantified. The unification with AO-ARMs is more of a reinterpretation than a new algorithm, though it's useful for understanding. This paper is for researchers focused on generative models for discrete data like text. Someone looking for cleaner ways to parameterize discrete diffusion or links to autoregressive methods will find it relevant. It has enough formal grounding and reproducible results to merit a serious referee. I would send it out for peer review. The core contribution is verifiable from the math, and the experiments provide a reasonable test even if more speed details would strengthen the practical side.

Referee Report

2 major / 2 minor

Summary. The manuscript shows that the concrete score (ratio of marginal transition probabilities) in standard absorbing discrete diffusion equals the conditional distribution p(x_0 | x_t) of the clean data multiplied by a closed-form, time-dependent scalar derived from the cumulative absorption probabilities. This identity motivates RADD, a time-unconditioned network that directly parametrizes the time-independent conditionals, permits output caching during intervals when the noisy token is unchanged, and thereby reduces NFEs. The paper further unifies absorbing diffusion with any-order autoregressive models by recasting the diffusion NLL upper bound as an expected NLL under AO-AR sampling, and reports state-of-the-art perplexity among diffusion models on five zero-shot language-modeling benchmarks at GPT-2 scale, with code released.

Significance. If the algebraic identity holds, the work supplies a clarifying reparameterization that removes explicit time conditioning from absorbing diffusion while preserving correctness, yields a practical caching acceleration, and supplies a clean theoretical bridge to AO-ARMs. The released code and GPT-2-scale empirical results make the contribution immediately usable and reproducible.

major comments (2)

[§5 (Experiments)] §5 (Experiments) and sampling-acceleration paragraph: the claim that caching reduces wall-clock time is supported only by NFE counts; no ablation on caching schedule (e.g., interval length or token-change threshold) or direct wall-clock timing against a non-cached baseline is provided, leaving the practical speedup unsubstantiated.
[§4.2 (Unification)] §4.2 (Unification with AO-ARMs): the statement that the diffusion NLL upper bound equals an expected NLL for AO-ARMs is asserted without an explicit derivation or small-scale verification; because this unification is presented as a conceptual contribution, a short proof sketch or numerical check would be required to confirm the equality holds under the paper’s absorbing kernel.

minor comments (2)

[Abstract] Abstract: the five zero-shot benchmarks are not named; listing them (e.g., WikiText, LAMBADA, etc.) would improve immediate readability.
[§3 (Theoretical section)] Notation in §3: the time-dependent scalar is introduced in closed form but its derivation from the ratio of marginals is compressed; expanding the ratio p(x_t | x_0)/p(x_t) step-by-step would aid readers unfamiliar with the absorbing kernel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§5 (Experiments)] §5 (Experiments) and sampling-acceleration paragraph: the claim that caching reduces wall-clock time is supported only by NFE counts; no ablation on caching schedule (e.g., interval length or token-change threshold) or direct wall-clock timing against a non-cached baseline is provided, leaving the practical speedup unsubstantiated.

Authors: We thank the referee for this observation. The current manuscript demonstrates that RADD enables output caching whenever the noisy token is unchanged within a sampling interval, which directly lowers the number of network evaluations. While NFEs provide a standard and architecture-independent proxy for the computational saving, we agree that wall-clock timings and ablations on caching parameters (interval length, change threshold) would give a more complete picture of practical speedup. In the revised manuscript we will add these results, reporting wall-clock time on the same hardware for cached versus non-cached sampling together with a short sensitivity study on the caching schedule. revision: yes
Referee: [§4.2 (Unification)] §4.2 (Unification with AO-ARMs): the statement that the diffusion NLL upper bound equals an expected NLL for AO-ARMs is asserted without an explicit derivation or small-scale verification; because this unification is presented as a conceptual contribution, a short proof sketch or numerical check would be required to confirm the equality holds under the paper’s absorbing kernel.

Authors: We agree that an explicit derivation strengthens the conceptual contribution. The claimed equality follows from rewriting the diffusion ELBO under the absorbing transition kernel as an expectation of per-token negative log-likelihoods taken with respect to the any-order autoregressive sampling distribution induced by the same kernel. In the revision we will insert a concise proof sketch in §4.2 that starts from the marginal transition probabilities, substitutes the concrete-score reparameterization, and arrives at the expected AO-ARM NLL. We will also include a small-scale numerical check on a synthetic sequence dataset to verify that the two quantities match within sampling error. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is algebraic identity from absorbing kernel

full rationale

The paper's central identity follows directly from applying Bayes' rule to the standard absorbing Markov chain transition kernel, yielding p(x_0 | x_t) as the data marginal for absorbed tokens (or delta otherwise) independent of t, with the concrete score then factoring as this conditional times a closed-form t-dependent scalar from cumulative absorption probabilities. No parameters are fitted to data for the identity, no self-citation chains are load-bearing for the core claim, and the ratio form of marginals is definitional for the process. The subsequent RADD reparameterization, NFE caching, and unification with AO-ARMs are consequences of this identity rather than circular inputs. The derivation is self-contained and externally verifiable from the kernel definition alone.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation rests on the standard absorbing diffusion transition kernel and the existence of the marginal probabilities at each timestep; no new free parameters are introduced beyond those already present in prior discrete diffusion models.

axioms (1)

domain assumption The forward process uses the standard absorbing-state transition kernel with a fixed absorbing token.
Invoked when rewriting the ratio of marginals as the conditional probability of clean data.

pith-pipeline@v0.9.0 · 5775 in / 1348 out tokens · 27795 ms · 2026-05-18T12:00:07.982792+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 7.0

The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
Support Before Frequency in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...
Computer-Aided Design Generation by Cascaded Discrete Diffusion Model
cs.CV 2026-05 unverdicted novelty 7.0

Cascaded discrete diffusion generates CAD command sequences with absorbing transitions and parameters with Gaussian, scale-invariant, and prior-preserving kernels, outperforming autoregressive and continuous diffusion...
Hierarchical Codec Diffusion for Video-to-Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
MemDLM: Memory-Enhanced DLM Training
cs.CL 2026-03 unverdicted novelty 7.0

MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
cs.CL 2025-05 conditional novelty 7.0

Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Self-Supervised On-Policy Distillation for Reasoning Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
Edit-Based Refinement for Parallel Masked Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Differences in Text Generated by Diffusion and Autoregressive Language Models
cs.CL 2026-04 unverdicted novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
Diffusion Language Models Know the Answer Before Decoding
cs.CL 2025-08 conditional novelty 6.0

DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
cs.LG 2025-05 unverdicted novelty 6.0

Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image mo...
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
cs.CL 2024-10 conditional novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on langua...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 22 Pith papers · 3 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Advances in Neural Information Processing Systems , year=

A Continuous Time Framework for Discrete Denoising Models , author=. Advances in Neural Information Processing Systems , year=

work page
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

All are worth words: A vit backbone for diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[7]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models , author=. arXiv preprint arXiv:2405.04233 , year=

work page arXiv
[8]

International Conference on Machine Learning , year=

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale , author=. International Conference on Machine Learning , year=

work page
[9]

2024 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

work page 2024
[10]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[11]

Advances in Neural Information Processing Systems , year=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , year=

work page
[12]

International Conference on Computer Vision , year=

Scalable Diffusion Models with Transformers , author=. International Conference on Computer Vision , year=

work page
[13]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

work page 2019
[14]

Neurocomputing , year=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. Neurocomputing , year=

work page
[15]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[16]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[17]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[18]

ArXiv , year=

Training Compute-Optimal Large Language Models , author=. ArXiv , year=

work page
[19]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[20]

2023 , eprint=

Concrete Score Matching: Generalized Score Matching for Discrete Data , author=. 2023 , eprint=

work page 2023
[21]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

Self-conditioned Embedding Diffusion for Text Generation , author=. arXiv preprint arXiv:2211.04236 , year=

work page arXiv
[22]

2024 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2024 , eprint=

work page 2024
[23]

2023 , eprint=

Reflected Diffusion Models , author=. 2023 , eprint=

work page 2023
[24]

2022 , eprint=

Categorical SDEs with Simplex Diffusion , author=. 2022 , eprint=

work page 2022
[25]

2024 , eprint=

TESS: Text-to-Text Self-Conditioned Simplex Diffusion , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

Bayesian Flow Networks , author=. 2024 , eprint=

work page 2024
[27]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

work page 2022
[28]

2022 , eprint=

Self-conditioned Embedding Diffusion for Text Generation , author=. 2022 , eprint=

work page 2022
[29]

2022 , eprint=

DiffusER: Discrete Diffusion via Edit-based Reconstruction , author=. 2022 , eprint=

work page 2022
[30]

2022 , eprint=

Continuous diffusion for categorical data , author=. 2022 , eprint=

work page 2022
[31]

2023 , eprint=

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise , author=. 2023 , eprint=

work page 2023
[32]

2023 , eprint=

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning , author=. 2023 , eprint=

work page 2023
[33]

2023 , eprint=

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation , author=. 2023 , eprint=

work page 2023
[34]

Advances in Neural Information Processing Systems , volume=

Mauve: Measuring the gap between neural text and human text using divergence frontiers , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[38]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page
[39]

Proceedings of the 32nd International Conference on Machine Learning , year =

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author =. Proceedings of the 32nd International Conference on Machine Learning , year =

work page
[40]

Advances in Neural Information Processing Systems , volume=

Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

ArXiv , year=

A Reparameterized Discrete Diffusion Model for Text Generation , author=. ArXiv , year=

work page
[42]

Fast sampling via de- randomization for discrete diffusion models.arXiv preprint arXiv:2312.09193, 2023

Fast Sampling via De-randomization for Discrete Diffusion Models , author=. arXiv preprint arXiv:2312.09193 , year=

work page arXiv
[43]

Dinoiser: Diffused conditional se- quence learning by manipulating noises.arXiv preprint arXiv:2302.10025, 2023

Dinoiser: Diffused conditional sequence learning by manipulating noises , author=. arXiv preprint arXiv:2302.10025 , year=

work page arXiv
[44]

ArXiv , year=

Improving and Unifying Discrete&Continuous-time Discrete Denoising Diffusion , author=. ArXiv , year=

work page
[45]

Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =

Emiel Hoogeboom and Alexey A. Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =. 10th International Conference on Learning Representations , year =

work page
[46]

Proceedings of the 31th International Conference on Machine Learning , year =

Benigno Uria and Iain Murray and Hugo Larochelle , title =. Proceedings of the 31th International Conference on Machine Learning , year =

work page
[47]

Variational Diffusion Models , volume =

Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , booktitle =. Variational Diffusion Models , volume =

work page
[48]

Proceedings of the 31th International Conference on Machine Learning , year=

Training and Inference on Any-Order Autoregressive Models the Right Way , author=. Proceedings of the 31th International Conference on Machine Learning , year=

work page
[49]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[50]

2012 , publisher=

Continuous-time Markov chains: An applications-oriented approach , author=. 2012 , publisher=

work page 2012
[51]

Reversibility and stochastic networks / F.P

Frank Kelly , year =. Reversibility and stochastic networks / F.P. Kelly , volume =. SERBIULA (sistema Librum 2.0) , doi =

work page
[52]

The Eleventh International Conference on Learning Representations , year=

Score-based Continuous-time Discrete Diffusion Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[53]

OpenWebText Corpus , author=

work page
[54]

Paperno, Denis and Kruszewski, Germ\'. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2016 , address =

work page 2016
[55]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[56]

Building a Large Annotated Corpus of English: The Penn Treebank , author=. Comput. Linguistics , year=

work page
[57]

Interspeech , year=

One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=

work page
[58]

Advances in Neural Information Processing Systems , year=

Likelihood-Based Diffusion Language Models , author=. Advances in Neural Information Processing Systems , year=

work page
[59]

2023 , eprint=

Score-based Continuous-time Discrete Diffusion Models , author=. 2023 , eprint=

work page 2023
[60]

2024 , eprint=

Unifying Bayesian Flow Networks and Diffusion Models through Stochastic Differential Equations , author=. 2024 , eprint=

work page 2024
[61]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[62]

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu

Diffusionbert: Improving generative masked language models with diffusion models , author=. arXiv preprint arXiv:2211.15029 , year=

work page arXiv
[63]

2018 , journal=

Improving language understanding by generative pre-training , author=. 2018 , journal=

work page 2018
[64]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[65]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[66]

a is b" fail to learn

The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=

work page arXiv
[67]

OpenAI blog , month=

OpenAI , url=. OpenAI blog , month=

work page
[68]

International Conference on Learning Representations , year =

Denoising Diffusion Implicit Models , author =. International Conference on Learning Representations , year =

work page
[69]

Analytic-

Fan Bao and Chongxuan Li and Jun Zhu and Bo Zhang , booktitle =. Analytic-

work page
[70]

Advances in Neural Information Processing Systems , volume=

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=

work page
[71]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

International Conference on Learning Representations , year=

Fast Sampling of Diffusion Models with Exponential Integrator , author=. International Conference on Learning Representations , year=

work page
[73]

arXiv preprint arXiv:2311.07468 , year=

Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse , author=. arXiv preprint arXiv:2311.07468 , year=

work page arXiv
[74]

2024 , eprint=

Fast Sampling via Discrete Non-Markov Diffusion Models , author=. 2024 , eprint=

work page 2024
[75]

2024 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=

work page 2024
[76]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

work page 2024
[77]

2024 , eprint=

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=

work page 2024

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

Advances in Neural Information Processing Systems , year=

A Continuous Time Framework for Discrete Denoising Models , author=. Advances in Neural Information Processing Systems , year=

work page

[5] [5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

All are worth words: A vit backbone for diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[7] [7]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models , author=. arXiv preprint arXiv:2405.04233 , year=

work page arXiv

[8] [8]

International Conference on Machine Learning , year=

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale , author=. International Conference on Machine Learning , year=

work page

[9] [9]

2024 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

work page 2024

[10] [10]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page

[11] [11]

Advances in Neural Information Processing Systems , year=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , year=

work page

[12] [12]

International Conference on Computer Vision , year=

Scalable Diffusion Models with Transformers , author=. International Conference on Computer Vision , year=

work page

[13] [13]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

work page 2019

[14] [14]

Neurocomputing , year=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. Neurocomputing , year=

work page

[15] [15]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023

[16] [16]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023

[17] [17]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page

[18] [18]

ArXiv , year=

Training Compute-Optimal Large Language Models , author=. ArXiv , year=

work page

[19] [19]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024

[20] [20]

2023 , eprint=

Concrete Score Matching: Generalized Score Matching for Discrete Data , author=. 2023 , eprint=

work page 2023

[21] [21]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

Self-conditioned Embedding Diffusion for Text Generation , author=. arXiv preprint arXiv:2211.04236 , year=

work page arXiv

[22] [22]

2024 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2024 , eprint=

work page 2024

[23] [23]

2023 , eprint=

Reflected Diffusion Models , author=. 2023 , eprint=

work page 2023

[24] [24]

2022 , eprint=

Categorical SDEs with Simplex Diffusion , author=. 2022 , eprint=

work page 2022

[25] [25]

2024 , eprint=

TESS: Text-to-Text Self-Conditioned Simplex Diffusion , author=. 2024 , eprint=

work page 2024

[26] [26]

2024 , eprint=

Bayesian Flow Networks , author=. 2024 , eprint=

work page 2024

[27] [27]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

work page 2022

[28] [28]

2022 , eprint=

Self-conditioned Embedding Diffusion for Text Generation , author=. 2022 , eprint=

work page 2022

[29] [29]

2022 , eprint=

DiffusER: Discrete Diffusion via Edit-based Reconstruction , author=. 2022 , eprint=

work page 2022

[30] [30]

2022 , eprint=

Continuous diffusion for categorical data , author=. 2022 , eprint=

work page 2022

[31] [31]

2023 , eprint=

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise , author=. 2023 , eprint=

work page 2023

[32] [32]

2023 , eprint=

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning , author=. 2023 , eprint=

work page 2023

[33] [33]

2023 , eprint=

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation , author=. 2023 , eprint=

work page 2023

[34] [34]

Advances in Neural Information Processing Systems , volume=

Mauve: Measuring the gap between neural text and human text using divergence frontiers , author=. Advances in Neural Information Processing Systems , volume=

work page

[35] [35]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page

[38] [38]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page

[39] [39]

Proceedings of the 32nd International Conference on Machine Learning , year =

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author =. Proceedings of the 32nd International Conference on Machine Learning , year =

work page

[40] [40]

Advances in Neural Information Processing Systems , volume=

Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in Neural Information Processing Systems , volume=

work page

[41] [41]

ArXiv , year=

A Reparameterized Discrete Diffusion Model for Text Generation , author=. ArXiv , year=

work page

[42] [42]

Fast sampling via de- randomization for discrete diffusion models.arXiv preprint arXiv:2312.09193, 2023

Fast Sampling via De-randomization for Discrete Diffusion Models , author=. arXiv preprint arXiv:2312.09193 , year=

work page arXiv

[43] [43]

Dinoiser: Diffused conditional se- quence learning by manipulating noises.arXiv preprint arXiv:2302.10025, 2023

Dinoiser: Diffused conditional sequence learning by manipulating noises , author=. arXiv preprint arXiv:2302.10025 , year=

work page arXiv

[44] [44]

ArXiv , year=

Improving and Unifying Discrete&Continuous-time Discrete Denoising Diffusion , author=. ArXiv , year=

work page

[45] [45]

Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =

Emiel Hoogeboom and Alexey A. Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =. 10th International Conference on Learning Representations , year =

work page

[46] [46]

Proceedings of the 31th International Conference on Machine Learning , year =

Benigno Uria and Iain Murray and Hugo Larochelle , title =. Proceedings of the 31th International Conference on Machine Learning , year =

work page

[47] [47]

Variational Diffusion Models , volume =

Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , booktitle =. Variational Diffusion Models , volume =

work page

[48] [48]

Proceedings of the 31th International Conference on Machine Learning , year=

Training and Inference on Any-Order Autoregressive Models the Right Way , author=. Proceedings of the 31th International Conference on Machine Learning , year=

work page

[49] [49]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015

[50] [50]

2012 , publisher=

Continuous-time Markov chains: An applications-oriented approach , author=. 2012 , publisher=

work page 2012

[51] [51]

Reversibility and stochastic networks / F.P

Frank Kelly , year =. Reversibility and stochastic networks / F.P. Kelly , volume =. SERBIULA (sistema Librum 2.0) , doi =

work page

[52] [52]

The Eleventh International Conference on Learning Representations , year=

Score-based Continuous-time Discrete Diffusion Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[53] [53]

OpenWebText Corpus , author=

work page

[54] [54]

Paperno, Denis and Kruszewski, Germ\'. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2016 , address =

work page 2016

[55] [55]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page

[56] [56]

Building a Large Annotated Corpus of English: The Penn Treebank , author=. Comput. Linguistics , year=

work page

[57] [57]

Interspeech , year=

One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=

work page

[58] [58]

Advances in Neural Information Processing Systems , year=

Likelihood-Based Diffusion Language Models , author=. Advances in Neural Information Processing Systems , year=

work page

[59] [59]

2023 , eprint=

Score-based Continuous-time Discrete Diffusion Models , author=. 2023 , eprint=

work page 2023

[60] [60]

2024 , eprint=

Unifying Bayesian Flow Networks and Diffusion Models through Stochastic Differential Equations , author=. 2024 , eprint=

work page 2024

[61] [61]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[62] [62]

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu

Diffusionbert: Improving generative masked language models with diffusion models , author=. arXiv preprint arXiv:2211.15029 , year=

work page arXiv

[63] [63]

2018 , journal=

Improving language understanding by generative pre-training , author=. 2018 , journal=

work page 2018

[64] [64]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[65] [65]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[66] [66]

a is b" fail to learn

The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=

work page arXiv

[67] [67]

OpenAI blog , month=

OpenAI , url=. OpenAI blog , month=

work page

[68] [68]

International Conference on Learning Representations , year =

Denoising Diffusion Implicit Models , author =. International Conference on Learning Representations , year =

work page

[69] [69]

Analytic-

Fan Bao and Chongxuan Li and Jun Zhu and Bo Zhang , booktitle =. Analytic-

work page

[70] [70]

Advances in Neural Information Processing Systems , volume=

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=

work page

[71] [71]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

International Conference on Learning Representations , year=

Fast Sampling of Diffusion Models with Exponential Integrator , author=. International Conference on Learning Representations , year=

work page

[73] [73]

arXiv preprint arXiv:2311.07468 , year=

Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse , author=. arXiv preprint arXiv:2311.07468 , year=

work page arXiv

[74] [74]

2024 , eprint=

Fast Sampling via Discrete Non-Markov Diffusion Models , author=. 2024 , eprint=

work page 2024

[75] [75]

2024 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=

work page 2024

[76] [76]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

work page 2024

[77] [77]

2024 , eprint=

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=

work page 2024