pith. sign in

arxiv: 2510.03206 · v2 · submitted 2025-10-03 · 💻 cs.AI · cs.CL

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Pith reviewed 2026-05-18 10:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords diffusion language modelscontinuous diffusiondiscrete diffusionlatent reasoningmultimodal diffusionlanguage modelingcoevolutionary diffusion
0
0 comments X

The pith

Continuous diffusion models gain stronger expressivity for language by jointly diffusing with discrete tokens in one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that continuous diffusion models are theoretically more expressive than discrete diffusion or looped transformers for language tasks. In practice, however, continuous models struggle to decode back to discrete tokens, limiting their performance. The authors therefore define a joint multimodal diffusion process that runs a single model across the union of a continuous representation space and a discrete token space. This coevolutionary setup supplies intermediate supervision from the continuous side while retaining the training stability and sample quality of explicit tokens. Experiments on real-world language modeling tasks confirm the combined benefits.

Core claim

Continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers, but practical decoding difficulties limit them; a joint multimodal diffusion process on the union of continuous representation and discrete token spaces, handled by one model, resolves the tension by delivering both rich latent semantics and good trainability.

What carries the argument

The Coevolutionary Continuous Discrete Diffusion (CCDD) process, a joint multimodal diffusion defined on the union of continuous representation space and discrete token space that lets one model denoise both modalities simultaneously.

If this is right

  • Continuous diffusion supplies intermediate supervision that looped transformers lack.
  • The joint process combines rich semantics in the latent space with explicit discrete tokens for better trainability.
  • Advanced architectures and training techniques enable the single-model joint denoising without modality collapse.
  • Empirical results on real-world tasks demonstrate improved language modeling performance over prior diffusion approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could generalize to other settings that mix continuous embeddings with discrete symbols, such as code or structured data generation.
  • It suggests that future diffusion language models may no longer need separate mechanisms for latent reasoning and token prediction.
  • Scaling the joint process might reveal whether the expressivity advantage grows with model size or sequence length.

Load-bearing premise

A single model can simultaneously and effectively denoise in both the continuous representation space and the discrete token space without one modality dominating training or degrading sample quality.

What would settle it

A direct comparison experiment measuring whether the joint CCDD model produces lower perplexity and higher sample quality than pure continuous diffusion, pure discrete diffusion, or looped transformers on a standard language modeling benchmark.

Figures

Figures reproduced from arXiv: 2510.03206 by Cai Zhou, Chenxiao Yang, Chenyu Wang, Chubin Zhang, Dinghuai Zhang, Lester Mackey, Muhan Zhang, Stephen Bates, Tommi Jaakkola, Yi Hu.

Figure 1
Figure 1. Figure 1: Comparison of theoretical expressiveness and practical trainability of: discrete diffusion [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of validation losses when using representations from different layers of Qwen3-Embedding￾0.6B as the latent spaces for CDMs. Intriguingly, despite their strong theoretical expressive￾ness, previous looped transformers tend to exhibit lim￾ited empirical performance compared with SOTA LLMs. Meanwhile, continuous DLMs typically underperform their discrete counterparts, contradicting to the expressi… view at source ↗
Figure 3
Figure 3. Figure 3: Framework of Coevolutionary Continuous Discrete Diffusion. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different denoising network architectures for CCDD. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that continuous diffusion models possess stronger expressivity than discrete diffusion models and looped transformers. It attributes the observed empirical underperformance of continuous approaches to trainability challenges when decoding from continuous representation space back to discrete tokens. To resolve this, the authors introduce Coevolutionary Continuous Discrete Diffusion (CCDD), which performs a joint multimodal diffusion process over the union of a continuous latent space and a discrete token space using a single shared model for simultaneous denoising. The manuscript further describes supporting architectures, training, and sampling techniques, and reports strong empirical results on language modeling tasks.

Significance. If the expressivity proof is rigorous and the joint coevolutionary process demonstrably balances the two modalities without gradient dominance or degraded sample quality, the work could meaningfully advance diffusion language models by enabling richer latent reasoning while retaining discrete anchoring. The approach offers a concrete mechanism to combine semantic density in continuous space with explicit token supervision, which is a load-bearing contribution if the experiments confirm it.

major comments (3)
  1. [Expressivity proof (likely §3)] The central claim that continuous diffusion has strictly stronger expressivity than discrete diffusion and looped transformers is load-bearing for the motivation of CCDD, yet the abstract and early sections provide no derivation details, key equations, or explicit comparison metrics. A concrete proof sketch (e.g., in §3 or the appendix) showing the precise sense in which expressivity is stronger, including any assumptions on the diffusion schedules, is required.
  2. [CCDD joint process definition] The weakest assumption—that a single model can jointly denoise the continuous representation space and discrete token space without one modality dominating gradients or harming sample quality—is not yet shown to be resolved by the coevolutionary design. The loss formulation, any balancing coefficients, or gradient-norm analysis that prevents continuous dominance must be specified and validated.
  3. [Experiments section] Empirical claims of strong performance are central but unsupported in the visible material: no baselines (e.g., masked discrete diffusion or looped transformers), no quantitative metrics (perplexity, generation quality), no error analysis, and no ablation on the coevolutionary components are referenced. Tables or figures comparing CCDD against prior methods are necessary to substantiate the resolution of the trainability gap.
minor comments (2)
  1. [Method] Clarify the precise mathematical definition of the union space and the joint forward/reverse processes to avoid ambiguity in how continuous and discrete variables interact during sampling.
  2. [Related work] Add explicit discussion of related continuous chain-of-thought and latent-reasoning diffusion works to better situate the novelty of the coevolutionary mechanism.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested clarifications, expansions, and additional experimental details.

read point-by-point responses
  1. Referee: [Expressivity proof (likely §3)] The central claim that continuous diffusion has strictly stronger expressivity than discrete diffusion and looped transformers is load-bearing for the motivation of CCDD, yet the abstract and early sections provide no derivation details, key equations, or explicit comparison metrics. A concrete proof sketch (e.g., in §3 or the appendix) showing the precise sense in which expressivity is stronger, including any assumptions on the diffusion schedules, is required.

    Authors: We agree that the expressivity claim benefits from a more self-contained presentation. Section 3 of the manuscript contains the proof, but we have now added an explicit proof sketch at the beginning of Section 3 (with the full derivation moved to the appendix for completeness). The sketch shows that continuous diffusion can represent a strictly larger family of conditional distributions than discrete diffusion or looped transformers by leveraging the density of continuous latent trajectories; we state the precise assumptions on the diffusion schedules (Gaussian noise for the continuous component and categorical for the discrete component) and include a short comparison table of expressivity metrics. revision: yes

  2. Referee: [CCDD joint process definition] The weakest assumption—that a single model can jointly denoise the continuous representation space and discrete token space without one modality dominating gradients or harming sample quality—is not yet shown to be resolved by the coevolutionary design. The loss formulation, any balancing coefficients, or gradient-norm analysis that prevents continuous dominance must be specified and validated.

    Authors: We thank the referee for identifying this key technical point. The joint loss is defined as L = L_cont + λ L_disc, where λ is a scalar balancing coefficient. In the revised manuscript we have added the explicit loss formulation, the schedule for λ, and a gradient-norm analysis (new Appendix C) demonstrating that the coevolutionary updates keep the gradient magnitudes of the two modalities within a factor of two throughout training. We also report an ablation on λ that confirms stable sample quality and no degradation when the modalities are jointly denoised. revision: yes

  3. Referee: [Experiments section] Empirical claims of strong performance are central but unsupported in the visible material: no baselines (e.g., masked discrete diffusion or looped transformers), no quantitative metrics (perplexity, generation quality), no error analysis, and no ablation on the coevolutionary components are referenced. Tables or figures comparing CCDD against prior methods are necessary to substantiate the resolution of the trainability gap.

    Authors: We apologize that the experimental details were not sufficiently prominent. The full manuscript already contains language-modeling results on standard benchmarks, but we have now added a new Table 1 that directly compares CCDD against masked discrete diffusion and looped-transformer baselines using perplexity and generation-quality metrics. We have also inserted error analysis, statistical significance tests, and a dedicated ablation study on the coevolutionary components (joint vs. separate denoising, effect of λ) to substantiate that the trainability gap is closed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's derivation starts from a claimed independent mathematical proof that continuous diffusion models possess stronger expressivity than discrete diffusions and looped transformers. This theoretical result is positioned as external to the subsequent proposal of CCDD, which is motivated by addressing an identified trainability gap between theory and empirical performance. The CCDD construction defines a joint diffusion process on continuous and discrete spaces using a single model, but this is presented as a design choice rather than any reduction where outputs equal inputs by construction, fitted parameters are renamed as predictions, or load-bearing premises collapse to self-citations. No equations, ansatzes, or uniqueness theorems are shown to be self-referential in the available text. The overall argument remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Without access to the full manuscript, the ledger cannot be populated with specific free parameters, axioms, or invented entities; the abstract introduces CCDD as a new modeling construct but does not detail any fitted constants or background assumptions.

pith-pipeline@v0.9.0 · 5783 in / 1080 out tokens · 30708 ms · 2026-05-18T10:12:22.008338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

    cs.LG 2026-05 unverdicted novelty 6.0

    DiLaDiff augments masked diffusion LMs with latent space modeling and consistency distillation to improve token correlation capture and inference speed.

  2. Understanding and Accelerating the Training of Masked Diffusion Language Models

    cs.LG 2026-05 conditional novelty 6.0

    Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.

  3. Solve the Loop: Attractor Models for Language and Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped tr...

  4. Co-Generative De Novo Functional Protein Design

    q-bio.QM 2026-05 unverdicted novelty 5.0

    CodeFP jointly generates protein sequences and structures using functional local structures and auxiliary supervision, yielding 6.1% better functional consistency and 3.2% better foldability than prior baselines.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 4 Pith papers · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Why masking diffusion works: Condition on the jump schedule for improved discrete diffusion.arXiv preprint arXiv:2506.08316,

    Alan N Amin, Nate Gruver, and Andrew Gordon Wilson. Why masking diffusion works: Condition on the jump schedule for improved discrete diffusion.arXiv preprint arXiv:2506.08316,

  3. [3]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Re- laxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672,

  4. [4]

    Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

  5. [5]

    Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. arXiv preprint arXiv:2402.04997,

  6. [6]

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005,

  7. [7]

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,

  9. [9]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018a. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018b. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanov...

  10. [10]

    Continuous diffusion for categorical data

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089,

  11. [11]

    Learning iterative reasoning through energy diffusion.arXiv preprint arXiv:2406.11179,

    Yilun Du, Jiayuan Mao, and Joshua B Tenenbaum. Learning iterative reasoning through energy diffusion.arXiv preprint arXiv:2406.11179,

  12. [12]

    arXiv preprint arXiv:2409.15647 , year=

    10 Preprint. Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,

  13. [13]

    La-proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466,

    Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, and Arash Vahdat. La-proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466,

  14. [14]

    DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

  15. [15]

    Scaling Diffusion Language Models via Adaptation from Autoregressive Models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891,

  16. [16]

    Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

  17. [17]

    Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

    Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648,

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  19. [19]

    Ssd- lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432, 2022

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432,

  20. [20]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

  21. [21]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  22. [22]

    Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239,

  23. [23]

    Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a

    Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768,

  24. [24]

    arXiv preprint arXiv:2504.16064 , year=

    Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Ko- modakis. Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064,

  25. [25]

    Geometric representation condition improves equivariant molecule generation.arXiv preprint arXiv:2410.03655, 2024b

    Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, and Muhan Zhang. Geometric representation condition improves equivariant molecule generation.arXiv preprint arXiv:2410.03655, 2024b. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint ar...

  26. [26]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834,

  27. [27]

    Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

    Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

  28. [28]

    A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

    William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.arXiv preprint arXiv:2503.03961,

  29. [29]

    Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

    Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845,

  30. [30]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

  31. [31]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736,

  32. [32]

    Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

    Accessed: 2024-11-15. Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903,

  33. [33]

    The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892,

  34. [34]

    arXiv preprint arXiv:2502.17416 (2025)

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

  35. [35]

    Flow matching with general discrete paths: A kinetic-optimal perspective.arXiv preprint arXiv:2412.03487,

    Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, and Ricky TQ Chen. Flow matching with general discrete paths: A kinetic-optimal perspective.arXiv preprint arXiv:2412.03487,

  36. [36]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Z Shen, H Yan, L Zhang, Z Hu, Y Du, and Y Codi He. Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,

  37. [37]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.ArXiv, abs/2010.02502,

  38. [38]

    Generalized interpolating discrete diffusion, 2025

    Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482,

  39. [39]

    Learning diffusion models with flexible representation guidance.arXiv preprint arXiv:2507.08980,

    Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, and Tommi Jaakkola. Learning diffusion models with flexible representation guidance.arXiv preprint arXiv:2507.08980,

  40. [40]

    arXiv preprint arXiv:2410.01405 , year=

    Kevin Xu and Issei Sato. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding.arXiv preprint arXiv:2410.01405,

  41. [41]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  42. [42]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  43. [43]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

  44. [44]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

  45. [45]

    Reasoning by superposition: A theoretical perspective on chain of continuous thought.CoRR, abs/2505.12514, 2025

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025a. Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on l...

  46. [46]

    is the recent looped transformer that practically works, which adaptively adjust the looping depth for tokens. In contrast to architectural recurrence, which necessitates explicit structural changes, an alterna- tive known ascontinuous chain-of-thought (continuous CoT)achieves comparable computational advantages through specialized training of standard tr...

  47. [47]

    Subsequent methods (Ou et al., 2024; Sahoo et al., 2024; Shi et al.,

    and SEDD (Lou et al., 2023), which introduced discrete transition processes and score matching losses. Subsequent methods (Ou et al., 2024; Sahoo et al., 2024; Shi et al.,

  48. [48]

    and reverse-order reasoning (Nie et al., 2025). The framework has also been integrated with chain-of-thought reasoning (Ye et al., 2024), demonstrating strong performance in tasks requiring parallel context and systematic refinement. Similar algorithms are proposed from the flow matching perspective (Gat et al., 2024). Additional to mask noises, some work...

  49. [49]

    and sequence-to-sequence tasks (Dieleman et al., 2022; Mahabadi et al., 2023; Gong et al., 2022), with Plaid (Gulrajani & Hashimoto,

  50. [50]

    later establishing empirical scaling laws that significantly narrowed the efficiency gap with autoregressive models. The framework was further extended by DoT-Plaid (Ye et al., 2024), which generalized chain-of-thought reasoning to EDMs, leveraging iterative latent refinement for improved coherence and mathematical reasoning. There are also a few continuo...

  51. [51]

    DUO (Sahoo et al.,

    and protein sequence-structure co-design (Campbell et al., 2024). DUO (Sahoo et al.,

  52. [52]

    In comparison, our work generalize their results and provide systematic 14 Preprint

    tries to connect two types of diffusion models via marginal matching, and apply distillation tricks for continuous diffusion to discrete text diffusion. In comparison, our work generalize their results and provide systematic 14 Preprint. analysis on expressiveness and trainability, while practically combine continuous and discrete models to benefit each o...

  53. [53]

    For any {pt}t∈[0,1] ∈F disc(θ), the embedded marginal qt :=E ♯pt ∈eFdisc(θ) is supported on afiniteset in RL×d

    Lemma 3(Embedded discrete trajectories are finitely supported at each t).Fix any t∈[0,1] . For any {pt}t∈[0,1] ∈F disc(θ), the embedded marginal qt :=E ♯pt ∈eFdisc(θ) is supported on afiniteset in RL×d. In particular, if E is one-hot or any fixed finite codebook, then qt is a finite mixture of Dirac masses inR L×d. Proof. For any t, pt is a probability ve...

  54. [54]

    single time-conditioned network

    Define a looped transformer Φode θ (·;k) that, at step k, applies the numerical increment Φode θ (z;k) := Ψ ∆tk(z, tk) =z+ ∆t k vθ(z, tk) +O(∆t 2 k), where vθ(·, tk) is computed by the samefθ(·, tk) (time-conditioned). Then unrolling T steps computes the same discrete trajectory as the ODE sampler up to the integrator’s local truncation error; as T→ ∞ (me...

  55. [55]

    = Cat(ηtx0 + (1−η t)πt)(USDM/masked). Define thejoint conditioning strength κt :=I (x0, z0);z t |x t +I (x0, z0);x t |z t , which quantifies how informative each modality remains about its clean counterpart when condition- ing on the other. Proposition 15(Entropy/MI matching heuristic).(Informal) Let βt and ut be chosen so that the ratesof MI decay from L...

  56. [56]

    This selection is consistent with the analysis in the main text

    Embedding spaces.Since Qwen3-Embedding enables flexible output dimensions down to 32, we use the 32-dimensional last-layer embeddings without specification. This selection is consistent with the analysis in the main text. Low-dimensional latent space is the standard setting in recent vision diffusion models (Esser et al., 2024). For RoBERTa-base embedding...

  57. [57]

    VP” refers to the variance preserving schedule in DDPM, and “Linear

    training. Without specification, we set the loss weights λcont =λ cont = 1 and use gradient clipping. Following Sahoo et al. (2024); von Rütte et al. (2025), on LM1B we set a constant learning rate 3×10 −4 with 2500 warm-up steps, and a constant learning rate 5×10 −4 with 10000 warm-up steps for OWT. We use AdamW optimizers with weight decay0.02and gradie...

  58. [58]

    The continuous forward process then starts with ˜z0 instead of the original z0

    to zeros in order to simulate the “masking” operation, leading to ˜z0 which also declines the direct information leakage. The continuous forward process then starts with ˜z0 instead of the original z0. To let the model capable of doing inference with these perturbed inputs, we also perform these masking operations with a certain probability pr during trai...