Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Pith reviewed 2026-05-18 12:00 UTC · model grok-4.3
The pith
The concrete score in absorbing discrete diffusion equals conditional probabilities of clean data multiplied by an analytic time-dependent scalar.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, the authors propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Built upon the new perspective of conditional distributions, they further unify absorbing discrete diffusion and any-order autoregressive models, showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs.
What carries the argument
The concrete score, defined as the ratio of marginal probabilities between two transitive states, which the paper rewrites in closed form as the product of clean-data conditionals and a known time-dependent multiplier.
If this is right
- RADD trains a network that outputs only time-independent conditional probabilities of clean data.
- Sampling accelerates by reusing cached network outputs whenever the noisy token does not change within a diffusion interval.
- The diffusion training objective supplies an upper bound that equals the expected negative log-likelihood of any-order autoregressive models.
- RADD reaches state-of-the-art perplexity among diffusion models on five zero-shot language-modeling benchmarks at GPT-2 scale.
Where Pith is reading between the lines
- The time-independent view could let practitioners replace diffusion schedules with fixed-order autoregressive training while preserving the same loss surface.
- Caching suggests that inference cost scales with the number of token changes rather than total diffusion steps, which may favor long-sequence generation.
- If the unification is tight, likelihood evaluation techniques from autoregressive models might transfer directly to diffusion models without extra machinery.
Load-bearing premise
The derivation assumes the standard absorbing-state transition kernel so that the ratio of marginal probabilities takes the stated analytic form at every timestep.
What would settle it
Compute the concrete score and the proposed conditional-probability expression on a small fixed vocabulary and check whether the equality holds for every timestep and every pair of states; any systematic mismatch would refute the claim.
read the original abstract
Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval, which enables sampling acceleration. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript shows that the concrete score (ratio of marginal transition probabilities) in standard absorbing discrete diffusion equals the conditional distribution p(x_0 | x_t) of the clean data multiplied by a closed-form, time-dependent scalar derived from the cumulative absorption probabilities. This identity motivates RADD, a time-unconditioned network that directly parametrizes the time-independent conditionals, permits output caching during intervals when the noisy token is unchanged, and thereby reduces NFEs. The paper further unifies absorbing diffusion with any-order autoregressive models by recasting the diffusion NLL upper bound as an expected NLL under AO-AR sampling, and reports state-of-the-art perplexity among diffusion models on five zero-shot language-modeling benchmarks at GPT-2 scale, with code released.
Significance. If the algebraic identity holds, the work supplies a clarifying reparameterization that removes explicit time conditioning from absorbing diffusion while preserving correctness, yields a practical caching acceleration, and supplies a clean theoretical bridge to AO-ARMs. The released code and GPT-2-scale empirical results make the contribution immediately usable and reproducible.
major comments (2)
- [§5 (Experiments)] §5 (Experiments) and sampling-acceleration paragraph: the claim that caching reduces wall-clock time is supported only by NFE counts; no ablation on caching schedule (e.g., interval length or token-change threshold) or direct wall-clock timing against a non-cached baseline is provided, leaving the practical speedup unsubstantiated.
- [§4.2 (Unification)] §4.2 (Unification with AO-ARMs): the statement that the diffusion NLL upper bound equals an expected NLL for AO-ARMs is asserted without an explicit derivation or small-scale verification; because this unification is presented as a conceptual contribution, a short proof sketch or numerical check would be required to confirm the equality holds under the paper’s absorbing kernel.
minor comments (2)
- [Abstract] Abstract: the five zero-shot benchmarks are not named; listing them (e.g., WikiText, LAMBADA, etc.) would improve immediate readability.
- [§3 (Theoretical section)] Notation in §3: the time-dependent scalar is introduced in closed form but its derivation from the ratio of marginals is compressed; expanding the ratio p(x_t | x_0)/p(x_t) step-by-step would aid readers unfamiliar with the absorbing kernel.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§5 (Experiments)] §5 (Experiments) and sampling-acceleration paragraph: the claim that caching reduces wall-clock time is supported only by NFE counts; no ablation on caching schedule (e.g., interval length or token-change threshold) or direct wall-clock timing against a non-cached baseline is provided, leaving the practical speedup unsubstantiated.
Authors: We thank the referee for this observation. The current manuscript demonstrates that RADD enables output caching whenever the noisy token is unchanged within a sampling interval, which directly lowers the number of network evaluations. While NFEs provide a standard and architecture-independent proxy for the computational saving, we agree that wall-clock timings and ablations on caching parameters (interval length, change threshold) would give a more complete picture of practical speedup. In the revised manuscript we will add these results, reporting wall-clock time on the same hardware for cached versus non-cached sampling together with a short sensitivity study on the caching schedule. revision: yes
-
Referee: [§4.2 (Unification)] §4.2 (Unification with AO-ARMs): the statement that the diffusion NLL upper bound equals an expected NLL for AO-ARMs is asserted without an explicit derivation or small-scale verification; because this unification is presented as a conceptual contribution, a short proof sketch or numerical check would be required to confirm the equality holds under the paper’s absorbing kernel.
Authors: We agree that an explicit derivation strengthens the conceptual contribution. The claimed equality follows from rewriting the diffusion ELBO under the absorbing transition kernel as an expectation of per-token negative log-likelihoods taken with respect to the any-order autoregressive sampling distribution induced by the same kernel. In the revision we will insert a concise proof sketch in §4.2 that starts from the marginal transition probabilities, substitutes the concrete-score reparameterization, and arrives at the expected AO-ARM NLL. We will also include a small-scale numerical check on a synthetic sequence dataset to verify that the two quantities match within sampling error. revision: yes
Circularity Check
No significant circularity; derivation is algebraic identity from absorbing kernel
full rationale
The paper's central identity follows directly from applying Bayes' rule to the standard absorbing Markov chain transition kernel, yielding p(x_0 | x_t) as the data marginal for absorbed tokens (or delta otherwise) independent of t, with the concrete score then factoring as this conditional times a closed-form t-dependent scalar from cumulative absorption probabilities. No parameters are fitted to data for the identity, no self-citation chains are load-bearing for the core claim, and the ratio form of marginals is definitional for the process. The subsequent RADD reparameterization, NFE caching, and unification with AO-ARMs are consequences of this identity rather than circular inputs. The derivation is self-contained and externally verifiable from the kernel definition alone.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The forward process uses the standard absorbing-state transition kernel with a fixed absorbing token.
Forward citations
Cited by 24 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
-
Support Before Frequency in Discrete Diffusion
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...
-
Computer-Aided Design Generation by Cascaded Discrete Diffusion Model
Cascaded discrete diffusion generates CAD command sequences with absorbing transitions and parameters with Gaussian, scale-invariant, and prior-preserving kernels, outperforming autoregressive and continuous diffusion...
-
Hierarchical Codec Diffusion for Video-to-Speech Generation
HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
MemDLM: Memory-Enhanced DLM Training
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
-
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
Edit-Based Refinement for Parallel Masked Diffusion Language Models
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
Diffusion Language Models Know the Answer Before Decoding
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image mo...
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on langua...
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Advances in Neural Information Processing Systems , year=
A Continuous Time Framework for Discrete Denoising Models , author=. Advances in Neural Information Processing Systems , year=
-
[5]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[6]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
All are worth words: A vit backbone for diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[7]
Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models
Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models , author=. arXiv preprint arXiv:2405.04233 , year=
-
[8]
International Conference on Machine Learning , year=
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale , author=. International Conference on Machine Learning , year=
-
[9]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=
work page 2024
-
[10]
Neural Information Processing Systems , year=
Attention is All you Need , author=. Neural Information Processing Systems , year=
-
[11]
Advances in Neural Information Processing Systems , year=
Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , year=
-
[12]
International Conference on Computer Vision , year=
Scalable Diffusion Models with Transformers , author=. International Conference on Computer Vision , year=
-
[13]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
work page 2019
-
[14]
RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. Neurocomputing , year=
-
[15]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[16]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[17]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
- [18]
- [19]
-
[20]
Concrete Score Matching: Generalized Score Matching for Discrete Data , author=. 2023 , eprint=
work page 2023
-
[21]
Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022
Self-conditioned Embedding Diffusion for Text Generation , author=. arXiv preprint arXiv:2211.04236 , year=
-
[22]
Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2024 , eprint=
work page 2024
- [23]
- [24]
-
[25]
TESS: Text-to-Text Self-Conditioned Simplex Diffusion , author=. 2024 , eprint=
work page 2024
- [26]
-
[27]
Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=
work page 2022
-
[28]
Self-conditioned Embedding Diffusion for Text Generation , author=. 2022 , eprint=
work page 2022
-
[29]
DiffusER: Discrete Diffusion via Edit-based Reconstruction , author=. 2022 , eprint=
work page 2022
- [30]
-
[31]
Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise , author=. 2023 , eprint=
work page 2023
-
[32]
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning , author=. 2023 , eprint=
work page 2023
-
[33]
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation , author=. 2023 , eprint=
work page 2023
-
[34]
Advances in Neural Information Processing Systems , volume=
Mauve: Measuring the gap between neural text and human text using divergence frontiers , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[38]
International Conference on Learning Representations , year=
Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=
-
[39]
Proceedings of the 32nd International Conference on Machine Learning , year =
Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author =. Proceedings of the 32nd International Conference on Machine Learning , year =
-
[40]
Advances in Neural Information Processing Systems , volume=
Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
A Reparameterized Discrete Diffusion Model for Text Generation , author=. ArXiv , year=
-
[42]
Fast Sampling via De-randomization for Discrete Diffusion Models , author=. arXiv preprint arXiv:2312.09193 , year=
-
[43]
Dinoiser: Diffused conditional sequence learning by manipulating noises , author=. arXiv preprint arXiv:2302.10025 , year=
-
[44]
Improving and Unifying Discrete&Continuous-time Discrete Denoising Diffusion , author=. ArXiv , year=
-
[45]
Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =
Emiel Hoogeboom and Alexey A. Gritsenko and Jasmijn Bastings and Ben Poole and Rianne van den Berg and Tim Salimans , title =. 10th International Conference on Learning Representations , year =
-
[46]
Proceedings of the 31th International Conference on Machine Learning , year =
Benigno Uria and Iain Murray and Hugo Larochelle , title =. Proceedings of the 31th International Conference on Machine Learning , year =
-
[47]
Variational Diffusion Models , volume =
Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , booktitle =. Variational Diffusion Models , volume =
-
[48]
Proceedings of the 31th International Conference on Machine Learning , year=
Training and Inference on Any-Order Autoregressive Models the Right Way , author=. Proceedings of the 31th International Conference on Machine Learning , year=
-
[49]
International conference on machine learning , pages=
Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[50]
Continuous-time Markov chains: An applications-oriented approach , author=. 2012 , publisher=
work page 2012
-
[51]
Reversibility and stochastic networks / F.P
Frank Kelly , year =. Reversibility and stochastic networks / F.P. Kelly , volume =. SERBIULA (sistema Librum 2.0) , doi =
-
[52]
The Eleventh International Conference on Learning Representations , year=
Score-based Continuous-time Discrete Diffusion Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[53]
OpenWebText Corpus , author=
-
[54]
Paperno, Denis and Kruszewski, Germ\'. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2016 , address =
work page 2016
-
[55]
International Conference on Learning Representations , year=
Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
-
[56]
Building a Large Annotated Corpus of English: The Penn Treebank , author=. Comput. Linguistics , year=
-
[57]
One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=
-
[58]
Advances in Neural Information Processing Systems , year=
Likelihood-Based Diffusion Language Models , author=. Advances in Neural Information Processing Systems , year=
-
[59]
Score-based Continuous-time Discrete Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[60]
Unifying Bayesian Flow Networks and Diffusion Models through Stochastic Differential Equations , author=. 2024 , eprint=
work page 2024
-
[61]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[62]
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu
Diffusionbert: Improving generative masked language models with diffusion models , author=. arXiv preprint arXiv:2211.15029 , year=
-
[63]
Improving language understanding by generative pre-training , author=. 2018 , journal=
work page 2018
-
[64]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[65]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[66]
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author=. arXiv preprint arXiv:2309.12288 , year=
- [67]
-
[68]
International Conference on Learning Representations , year =
Denoising Diffusion Implicit Models , author =. International Conference on Learning Representations , year =
- [69]
-
[70]
Advances in Neural Information Processing Systems , volume=
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=
-
[71]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
International Conference on Learning Representations , year=
Fast Sampling of Diffusion Models with Exponential Integrator , author=. International Conference on Learning Representations , year=
-
[73]
arXiv preprint arXiv:2311.07468 , year=
Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse , author=. arXiv preprint arXiv:2311.07468 , year=
-
[74]
Fast Sampling via Discrete Non-Markov Diffusion Models , author=. 2024 , eprint=
work page 2024
-
[75]
Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=
work page 2024
-
[76]
Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=
work page 2024
-
[77]
Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.