Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
Pith reviewed 2026-05-18 10:12 UTC · model grok-4.3
The pith
Continuous diffusion models gain stronger expressivity for language by jointly diffusing with discrete tokens in one model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers, but practical decoding difficulties limit them; a joint multimodal diffusion process on the union of continuous representation and discrete token spaces, handled by one model, resolves the tension by delivering both rich latent semantics and good trainability.
What carries the argument
The Coevolutionary Continuous Discrete Diffusion (CCDD) process, a joint multimodal diffusion defined on the union of continuous representation space and discrete token space that lets one model denoise both modalities simultaneously.
If this is right
- Continuous diffusion supplies intermediate supervision that looped transformers lack.
- The joint process combines rich semantics in the latent space with explicit discrete tokens for better trainability.
- Advanced architectures and training techniques enable the single-model joint denoising without modality collapse.
- Empirical results on real-world tasks demonstrate improved language modeling performance over prior diffusion approaches.
Where Pith is reading between the lines
- The design could generalize to other settings that mix continuous embeddings with discrete symbols, such as code or structured data generation.
- It suggests that future diffusion language models may no longer need separate mechanisms for latent reasoning and token prediction.
- Scaling the joint process might reveal whether the expressivity advantage grows with model size or sequence length.
Load-bearing premise
A single model can simultaneously and effectively denoise in both the continuous representation space and the discrete token space without one modality dominating training or degrading sample quality.
What would settle it
A direct comparison experiment measuring whether the joint CCDD model produces lower perplexity and higher sample quality than pure continuous diffusion, pure discrete diffusion, or looped transformers on a standard language modeling benchmark.
Figures
read the original abstract
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that continuous diffusion models possess stronger expressivity than discrete diffusion models and looped transformers. It attributes the observed empirical underperformance of continuous approaches to trainability challenges when decoding from continuous representation space back to discrete tokens. To resolve this, the authors introduce Coevolutionary Continuous Discrete Diffusion (CCDD), which performs a joint multimodal diffusion process over the union of a continuous latent space and a discrete token space using a single shared model for simultaneous denoising. The manuscript further describes supporting architectures, training, and sampling techniques, and reports strong empirical results on language modeling tasks.
Significance. If the expressivity proof is rigorous and the joint coevolutionary process demonstrably balances the two modalities without gradient dominance or degraded sample quality, the work could meaningfully advance diffusion language models by enabling richer latent reasoning while retaining discrete anchoring. The approach offers a concrete mechanism to combine semantic density in continuous space with explicit token supervision, which is a load-bearing contribution if the experiments confirm it.
major comments (3)
- [Expressivity proof (likely §3)] The central claim that continuous diffusion has strictly stronger expressivity than discrete diffusion and looped transformers is load-bearing for the motivation of CCDD, yet the abstract and early sections provide no derivation details, key equations, or explicit comparison metrics. A concrete proof sketch (e.g., in §3 or the appendix) showing the precise sense in which expressivity is stronger, including any assumptions on the diffusion schedules, is required.
- [CCDD joint process definition] The weakest assumption—that a single model can jointly denoise the continuous representation space and discrete token space without one modality dominating gradients or harming sample quality—is not yet shown to be resolved by the coevolutionary design. The loss formulation, any balancing coefficients, or gradient-norm analysis that prevents continuous dominance must be specified and validated.
- [Experiments section] Empirical claims of strong performance are central but unsupported in the visible material: no baselines (e.g., masked discrete diffusion or looped transformers), no quantitative metrics (perplexity, generation quality), no error analysis, and no ablation on the coevolutionary components are referenced. Tables or figures comparing CCDD against prior methods are necessary to substantiate the resolution of the trainability gap.
minor comments (2)
- [Method] Clarify the precise mathematical definition of the union space and the joint forward/reverse processes to avoid ambiguity in how continuous and discrete variables interact during sampling.
- [Related work] Add explicit discussion of related continuous chain-of-thought and latent-reasoning diffusion works to better situate the novelty of the coevolutionary mechanism.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested clarifications, expansions, and additional experimental details.
read point-by-point responses
-
Referee: [Expressivity proof (likely §3)] The central claim that continuous diffusion has strictly stronger expressivity than discrete diffusion and looped transformers is load-bearing for the motivation of CCDD, yet the abstract and early sections provide no derivation details, key equations, or explicit comparison metrics. A concrete proof sketch (e.g., in §3 or the appendix) showing the precise sense in which expressivity is stronger, including any assumptions on the diffusion schedules, is required.
Authors: We agree that the expressivity claim benefits from a more self-contained presentation. Section 3 of the manuscript contains the proof, but we have now added an explicit proof sketch at the beginning of Section 3 (with the full derivation moved to the appendix for completeness). The sketch shows that continuous diffusion can represent a strictly larger family of conditional distributions than discrete diffusion or looped transformers by leveraging the density of continuous latent trajectories; we state the precise assumptions on the diffusion schedules (Gaussian noise for the continuous component and categorical for the discrete component) and include a short comparison table of expressivity metrics. revision: yes
-
Referee: [CCDD joint process definition] The weakest assumption—that a single model can jointly denoise the continuous representation space and discrete token space without one modality dominating gradients or harming sample quality—is not yet shown to be resolved by the coevolutionary design. The loss formulation, any balancing coefficients, or gradient-norm analysis that prevents continuous dominance must be specified and validated.
Authors: We thank the referee for identifying this key technical point. The joint loss is defined as L = L_cont + λ L_disc, where λ is a scalar balancing coefficient. In the revised manuscript we have added the explicit loss formulation, the schedule for λ, and a gradient-norm analysis (new Appendix C) demonstrating that the coevolutionary updates keep the gradient magnitudes of the two modalities within a factor of two throughout training. We also report an ablation on λ that confirms stable sample quality and no degradation when the modalities are jointly denoised. revision: yes
-
Referee: [Experiments section] Empirical claims of strong performance are central but unsupported in the visible material: no baselines (e.g., masked discrete diffusion or looped transformers), no quantitative metrics (perplexity, generation quality), no error analysis, and no ablation on the coevolutionary components are referenced. Tables or figures comparing CCDD against prior methods are necessary to substantiate the resolution of the trainability gap.
Authors: We apologize that the experimental details were not sufficiently prominent. The full manuscript already contains language-modeling results on standard benchmarks, but we have now added a new Table 1 that directly compares CCDD against masked discrete diffusion and looped-transformer baselines using perplexity and generation-quality metrics. We have also inserted error analysis, statistical significance tests, and a dedicated ablation study on the coevolutionary components (joint vs. separate denoising, effect of λ) to substantiate that the trainability gap is closed. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's derivation starts from a claimed independent mathematical proof that continuous diffusion models possess stronger expressivity than discrete diffusions and looped transformers. This theoretical result is positioned as external to the subsequent proposal of CCDD, which is motivated by addressing an identified trainability gap between theory and empirical performance. The CCDD construction defines a joint diffusion process on continuous and discrete spaces using a single model, but this is presented as a design choice rather than any reduction where outputs equal inputs by construction, fitted parameters are renamed as predictions, or load-bearing premises collapse to self-citations. No equations, ansatzes, or uniqueness theorems are shown to be self-referential in the available text. The overall argument remains self-contained with independent content.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers... CCDD which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continuous diffusion generalizes looped transformer... simulate any looped transformer with the same architecture
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
DiLaDiff augments masked diffusion LMs with latent space modeling and consistency distillation to improve token correlation capture and inference speed.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
Solve the Loop: Attractor Models for Language and Reasoning
Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped tr...
-
Co-Generative De Novo Functional Protein Design
CodeFP jointly generates protein sequences and structures using functional local structures and auxiliary supervision, yielding 6.1% better functional consistency and 3.2% better foldability than prior baselines.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Alan N Amin, Nate Gruver, and Andrew Gordon Wilson. Why masking diffusion works: Condition on the jump schedule for improved discrete diffusion.arXiv preprint arXiv:2506.08316,
-
[3]
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Re- laxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672,
-
[4]
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,
-
[5]
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. arXiv preprint arXiv:2402.04997,
-
[6]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018a. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018b. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanov...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[10]
Continuous diffusion for categorical data
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Learning iterative reasoning through energy diffusion.arXiv preprint arXiv:2406.11179,
Yilun Du, Jiayuan Mao, and Joshua B Tenenbaum. Learning iterative reasoning through energy diffusion.arXiv preprint arXiv:2406.11179,
-
[12]
arXiv preprint arXiv:2409.15647 , year=
10 Preprint. Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,
-
[13]
Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, and Arash Vahdat. La-proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466,
work page internal anchor Pith review arXiv
-
[14]
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,
work page internal anchor Pith review arXiv
-
[15]
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891,
work page internal anchor Pith review arXiv
-
[16]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,
-
[17]
Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648,
-
[18]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432,
-
[20]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[23]
Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768,
-
[24]
arXiv preprint arXiv:2504.16064 , year=
Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Ko- modakis. Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064,
-
[25]
Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, and Muhan Zhang. Geometric representation condition improves equivariant molecule generation.arXiv preprint arXiv:2410.03655, 2024b. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint ar...
-
[26]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,
Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,
-
[28]
William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.arXiv preprint arXiv:2503.03961,
-
[29]
Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of-thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845,
-
[30]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736,
work page internal anchor Pith review arXiv
-
[32]
Accessed: 2024-11-15. Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903,
-
[33]
The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892,
-
[34]
arXiv preprint arXiv:2502.17416 (2025)
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,
-
[35]
Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, and Ricky TQ Chen. Flow matching with general discrete paths: A kinetic-optimal perspective.arXiv preprint arXiv:2412.03487,
-
[36]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Z Shen, H Yan, L Zhang, Z Hu, Y Du, and Y Codi He. Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.ArXiv, abs/2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[38]
Generalized interpolating discrete diffusion, 2025
Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482,
-
[39]
Learning diffusion models with flexible representation guidance.arXiv preprint arXiv:2507.08980,
Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, and Tommi Jaakkola. Learning diffusion models with flexible representation guidance.arXiv preprint arXiv:2507.08980,
-
[40]
arXiv preprint arXiv:2410.01405 , year=
Kevin Xu and Issei Sato. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding.arXiv preprint arXiv:2410.01405,
-
[41]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025a. Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on l...
-
[46]
is the recent looped transformer that practically works, which adaptively adjust the looping depth for tokens. In contrast to architectural recurrence, which necessitates explicit structural changes, an alterna- tive known ascontinuous chain-of-thought (continuous CoT)achieves comparable computational advantages through specialized training of standard tr...
work page 2024
-
[47]
Subsequent methods (Ou et al., 2024; Sahoo et al., 2024; Shi et al.,
and SEDD (Lou et al., 2023), which introduced discrete transition processes and score matching losses. Subsequent methods (Ou et al., 2024; Sahoo et al., 2024; Shi et al.,
work page 2023
-
[48]
and reverse-order reasoning (Nie et al., 2025). The framework has also been integrated with chain-of-thought reasoning (Ye et al., 2024), demonstrating strong performance in tasks requiring parallel context and systematic refinement. Similar algorithms are proposed from the flow matching perspective (Gat et al., 2024). Additional to mask noises, some work...
work page 2025
-
[49]
and sequence-to-sequence tasks (Dieleman et al., 2022; Mahabadi et al., 2023; Gong et al., 2022), with Plaid (Gulrajani & Hashimoto,
work page 2022
-
[50]
later establishing empirical scaling laws that significantly narrowed the efficiency gap with autoregressive models. The framework was further extended by DoT-Plaid (Ye et al., 2024), which generalized chain-of-thought reasoning to EDMs, leveraging iterative latent refinement for improved coherence and mathematical reasoning. There are also a few continuo...
work page 2024
-
[51]
and protein sequence-structure co-design (Campbell et al., 2024). DUO (Sahoo et al.,
work page 2024
-
[52]
In comparison, our work generalize their results and provide systematic 14 Preprint
tries to connect two types of diffusion models via marginal matching, and apply distillation tricks for continuous diffusion to discrete text diffusion. In comparison, our work generalize their results and provide systematic 14 Preprint. analysis on expressiveness and trainability, while practically combine continuous and discrete models to benefit each o...
work page 2024
-
[53]
Lemma 3(Embedded discrete trajectories are finitely supported at each t).Fix any t∈[0,1] . For any {pt}t∈[0,1] ∈F disc(θ), the embedded marginal qt :=E ♯pt ∈eFdisc(θ) is supported on afiniteset in RL×d. In particular, if E is one-hot or any fixed finite codebook, then qt is a finite mixture of Dirac masses inR L×d. Proof. For any t, pt is a probability ve...
work page 2024
-
[54]
single time-conditioned network
Define a looped transformer Φode θ (·;k) that, at step k, applies the numerical increment Φode θ (z;k) := Ψ ∆tk(z, tk) =z+ ∆t k vθ(z, tk) +O(∆t 2 k), where vθ(·, tk) is computed by the samefθ(·, tk) (time-conditioned). Then unrolling T steps computes the same discrete trajectory as the ODE sampler up to the integrator’s local truncation error; as T→ ∞ (me...
work page 2024
-
[55]
= Cat(ηtx0 + (1−η t)πt)(USDM/masked). Define thejoint conditioning strength κt :=I (x0, z0);z t |x t +I (x0, z0);x t |z t , which quantifies how informative each modality remains about its clean counterpart when condition- ing on the other. Proposition 15(Entropy/MI matching heuristic).(Informal) Let βt and ut be chosen so that the ratesof MI decay from L...
work page 2025
-
[56]
This selection is consistent with the analysis in the main text
Embedding spaces.Since Qwen3-Embedding enables flexible output dimensions down to 32, we use the 32-dimensional last-layer embeddings without specification. This selection is consistent with the analysis in the main text. Low-dimensional latent space is the standard setting in recent vision diffusion models (Esser et al., 2024). For RoBERTa-base embedding...
work page 2024
-
[57]
VP” refers to the variance preserving schedule in DDPM, and “Linear
training. Without specification, we set the loss weights λcont =λ cont = 1 and use gradient clipping. Following Sahoo et al. (2024); von Rütte et al. (2025), on LM1B we set a constant learning rate 3×10 −4 with 2500 warm-up steps, and a constant learning rate 5×10 −4 with 10000 warm-up steps for OWT. We use AdamW optimizers with weight decay0.02and gradie...
work page 2024
-
[58]
The continuous forward process then starts with ˜z0 instead of the original z0
to zeros in order to simulate the “masking” operation, leading to ˜z0 which also declines the direct information leakage. The continuous forward process then starts with ˜z0 instead of the original z0. To let the model capable of doing inference with these perturbed inputs, we also perform these masking operations with a certain probability pr during trai...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.