pith. machine review for the scientific record. sign in

arxiv: 2605.06548 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Continuous Latent Diffusion Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords latent diffusionnon-autoregressive generationcontinuous latent spacehierarchical language modelingdiffusion transformertext variational autoencodersemantic priorscaling behavior
0
0 comments X

The pith

A hierarchical latent diffusion model separates global semantic organization from local text realization in continuous space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cola DLM to generate text by first mapping sequences to continuous latents via a Text VAE, then diffusing a global semantic prior with a block-causal DiT, and finally decoding words conditionally. This frames generation as latent prior transport rather than direct token recovery, creating a non-autoregressive path that compresses semantics independently of word order. A reader would care because the approach decouples high-level meaning from surface form, potentially allowing more flexible generation and scaling that tracks actual output quality better than likelihood scores alone.

Core claim

From a unified Markov-path perspective, Cola DLM's diffusion performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This yields a flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and extends naturally to other continuous modalities. Experiments on 8 benchmarks with matched ~2B-parameter baselines and scaling to 2000 EFLOPs identify an effective configuration and confirm strong scaling behavior for text generation.

What carries the argument

The hierarchical decomposition that uses a Text VAE for stable text-to-latent mapping, a block-causal DiT for diffusion-based global semantic prior transport in continuous space, and conditional decoding for final text output.

If this is right

  • Text generation gains a non-autoregressive inductive bias that organizes semantics globally before realizing local tokens.
  • Semantic compression and prior fitting occur directly in continuous space rather than through token likelihood.
  • Generation quality and scaling curves become stronger indicators of model capability than likelihood alone.
  • The same latent diffusion structure extends without modification to joint modeling of text with other continuous data types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of global semantics from token realization could reduce error accumulation in long sequences by enforcing high-level coherence first.
  • A shared continuous latent space might allow direct mixing of text generation with image or audio synthesis under one diffusion process.
  • Evaluation focus may shift toward measuring output coherence and scaling efficiency rather than perplexity on next-token prediction.

Load-bearing premise

A stable and invertible mapping from discrete text to continuous latent space exists so that block-causal diffusion can reliably carry global semantics to support high-quality conditional word generation.

What would settle it

Scaling curves showing that Cola DLM generation quality plateaus or lags behind matched autoregressive baselines past 2000 EFLOPs, or that the Text VAE mapping becomes unstable and non-invertible on diverse or long texts.

read the original abstract

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into a Text VAE for learning a stable text-to-latent mapping, a block-causal DiT for modeling a global semantic prior in continuous latent space, and conditional decoding for text generation. From a Markov-path view, the diffusion performs latent prior transport rather than token-level recovery. Experiments span 4 research questions and 8 benchmarks with strictly matched ~2B-parameter autoregressive and LLaDA baselines, plus scaling curves to ~2000 EFLOPs, claiming strong scaling behavior and establishing hierarchical continuous latent prior modeling as a principled non-autoregressive alternative to token-level language modeling.

Significance. If the empirical claims hold with full supporting data, the work would be significant for offering a continuous-space inductive bias that separates global semantics from local realization, with potential advantages in scaling and multimodal unification. The use of matched baselines and large-scale EFLOP curves provides a concrete basis for comparing generation quality and scaling behavior against likelihood-based AR models.

major comments (2)
  1. [Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.
  2. [Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.
minor comments (2)
  1. Clarify the precise definition of block-causality in the DiT architecture and how it interacts with the diffusion noise schedule; an explicit equation or diagram would aid reproducibility.
  2. Provide exact parameter counts, training token budgets, and optimizer settings for all baselines to ensure the 'strictly matched' comparison is fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the Text VAE validation. We have revised the manuscript to directly address both points by adding concrete numerical support and diagnostic metrics, which we believe strengthens the verifiability of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.

    Authors: We agree that the abstract would benefit from explicit quantitative anchors to make the central claims immediately verifiable. In the revised manuscript we have inserted concise numerical highlights drawn from the main results (e.g., average gains over the matched ~2B AR and LLaDA baselines across the eight benchmarks, together with the observed scaling trend to ~2000 EFLOPs). Detailed ablation tables, error bars, and per-benchmark breakdowns remain in the body and appendix, as space constraints preclude their inclusion in the abstract itself. These additions render the support for outperformance directly readable from the abstract while preserving its brevity. revision: yes

  2. Referee: [Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.

    Authors: We acknowledge that the original submission did not foreground explicit reconstruction and stability diagnostics for the Text VAE in the main text. We have added a dedicated paragraph in Section 3.1 together with a new appendix subsection that reports (i) BLEU and perplexity on held-out text, (ii) posterior-collapse diagnostics via KL-divergence statistics and histograms, and (iii) the KL-annealing schedule and corresponding curves. These metrics confirm faithful reconstruction without collapse. We further include a controlled ablation that isolates the VAE contribution from the block-causal DiT prior, showing that the reported scaling behavior and benchmark gains are not explained by VAE compression artifacts alone. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external empirical comparisons

full rationale

The paper presents Cola DLM as a hierarchical design (Text VAE for mapping, block-causal DiT for prior, conditional decoder) justified by experiments across 8 benchmarks, matched baselines, and scaling curves up to 2000 EFLOPs. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the Markov-path perspective is interpretive framing rather than a mathematical reduction. The central claim of principled alternative is supported by generation quality and scaling behavior versus autoregressive and LLaDA baselines, which are independent of internal fits. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the existence of a stable invertible latent mapping and on the ability of block-causal diffusion to organize global semantics; both are domain assumptions rather than derived results.

free parameters (2)
  • latent dimensionality
    Controls the degree of semantic compression in the Text VAE and must be selected to support both reconstruction and prior modeling.
  • diffusion noise schedule and number of steps
    Determines how the latent prior is learned and sampled; chosen to balance quality and compute.
axioms (2)
  • domain assumption A Text VAE can learn a stable, sufficiently invertible mapping from discrete text to continuous latent codes.
    Invoked in the first stage and required for the subsequent conditional decoding to succeed.
  • domain assumption Block-causal attention applied to latent codes can capture and transport global semantic structure.
    Central to the claim that the diffusion process performs prior transport rather than token-level recovery.

pith-pipeline@v0.9.0 · 5585 in / 1591 out tokens · 84718 ms · 2026-05-08T09:58:14.276942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

144 extracted references · 68 canonical work pages · 23 internal anchors

  1. [1]

    Tandem transformers for inference efficient llms

    PS Aishwarya, Pranav Ajit Nair, Yashas Samaga BL, Toby James Boyd, Sanjiv Kumar, Prateek Jain, and Praneeth Netrapalli. Tandem transformers for inference efficient llms. InForty-firstInternational Conference on Machine Learning, 2024

  2. [2]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  3. [3]

    The pitfalls of next-token prediction

    Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024

  4. [4]

    Block cascading: Training free acceleration of block-causal video models.arXiv preprint arXiv:2511.20426, 2025

    Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, and Varun Jampani. Block cascading: Training free acceleration of block-causal video models.arXiv preprint arXiv:2511.20426, 2025

  5. [5]

    Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

    Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

  6. [6]

    Large concept models: Language modeling in a sentence representation space

    Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space.arXiv preprint arXiv:2412.08821, 2024

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a".arXiv preprint arXiv:2309.12288, 2023

  8. [8]

    Generating sentences from a continuous space

    Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 10–21, 2016

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  10. [10]

    A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022

  11. [11]

    Towards a causal probabilistic framework for prediction, action-selection & explanations for robot block-stacking tasks.arXiv preprint arXiv:2308.06203, 2023

    Ricardo Cannizzaro, Jonathan Routley, and Lars Kunze. Towards a causal probabilistic framework for prediction, action-selection & explanations for robot block-stacking tasks.arXiv preprint arXiv:2308.06203, 2023

  12. [12]

    Exploring diffusion transformer designs via grafting

    Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting. arXiv preprint arXiv:2506.05340, 2025

  13. [13]

    A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

  14. [14]

    A cheaper and better diffusion language model with soft-masked noise

    Jiaao Chen, Aston Zhang, Mu Li, Alex Smola, and Diyi Yang. A cheaper and better diffusion language model with soft-masked noise. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4765–4775, 2023

  15. [15]

    Dlm-one: Diffusion language models for one-step sequence generation

    Tianqi Chen, Shujian Zhang, and Mingyuan Zhou. Dlm-one: Diffusion language models for one-step sequence generation. arXiv preprint arXiv:2506.00290, 2025

  16. [16]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  17. [17]

    Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019

    Murtaza Dalal, Alexander C Li, and Rohan Taori. Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019

  18. [18]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  19. [19]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026. 34

  20. [20]

    Promises, outlooks and challenges of diffusion language modeling

    Justin Deschenaux and Caglar Gulcehre. Promises, outlooks and challenges of diffusion language modeling. arXiv preprint arXiv:2406.11473, 2024

  21. [21]

    H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

  22. [22]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 320–335, 2022

  23. [23]

    Break the sequential dependency of LLM inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024

  24. [24]

    Empowering diffusion models on the embedding space for text generation

    Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. Empowering diffusion models on the embedding space for text generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4664–4683, 2024

  25. [25]

    Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024

  26. [26]

    A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

    Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, and Fatih Porikli. Skip to the good part: Representation structure & inference-time layer skipping in diffusion vs. autoregressive llms.arXiv preprint arXiv:2603.07475, 2026

  27. [27]

    Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

  28. [28]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

  29. [29]

    Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023

    Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023

  30. [30]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  31. [31]

    Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11575–11596, 2023

  32. [32]

    Unifying human and statistical evaluation for natural language generation

    Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, 2019

  33. [33]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  34. [34]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  35. [35]

    Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

    Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

  36. [36]

    Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021

    Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021

  37. [37]

    arXiv preprint arXiv:2404.09937 , year=

    Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelligence linearly.arXiv preprint arXiv:2404.09937, 2024. 35

  38. [38]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  39. [39]

    Block-recurrent trans- formers

    DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent trans- formers. Advancesin neural information processing systems, 35:33248–33261, 2022

  40. [40]

    Categorical Reparameterization with Gumbel-Softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016

  41. [41]

    Language agents as digital representatives in collective decision-making

    Daniel Jarrett, Miruna Pislar, Michiel A Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, and Andrea Tacchetti. Language agents as digital representatives in collective decision-making. arXiv preprint arXiv:2502.09369, 2025

  42. [42]

    Examining alignment of large language models through representative heuristics: the case of political stereotypes.arXiv preprint arXiv:2501.14294, 2025

    Sullam Jeoung, Yubin Ge, Haohan Wang, and Jana Diesner. Examining alignment of large language models through representative heuristics: the case of political stereotypes.arXiv preprint arXiv:2501.14294, 2025

  43. [43]

    Continuous diffusion model for language modeling

    Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. arXiv preprint arXiv:2502.11564, 2025

  44. [44]

    LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

    Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

  45. [45]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

  46. [46]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  47. [47]

    Improving diversity of demographic representation in large language models via collective-critiques and self-voting

    Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, et al. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pr...

  48. [48]

    Race: Large-scale reading comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

  49. [49]

    Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes

    Bocheng Li, Zhujin Gao, and Linli Xu. Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11530–11551, 2025

  50. [50]

    Beyond autoregression: An empirical study of diffusion large language models for code generation,

    Chengze Li, Yitong Zhang, Jia Li, Liyi Cai, and Ge Li. Beyond autoregression: An empirical study of diffusion large language models for code generation.arXiv preprint arXiv:2509.11252, 2025

  51. [51]

    Optimus: Organizing sentences via pre-trained modeling of a latent space

    Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng Gao. Optimus: Organizing sentences via pre-trained modeling of a latent space. InProceedings ofthe2020ConferenceonEmpiricalMethods in Natural Language Processing (EMNLP), pages 4678–4699, 2020

  52. [52]

    Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022

  53. [53]

    Limitations of autoregressive models and their alternatives

    Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R Gormley, and Jason Eisner. Limitations of autoregressive models and their alternatives. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 5147–5173, 2021

  54. [54]

    Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise

    Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pages 21051–21064. PMLR, 2023

  55. [55]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  56. [56]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 36

  57. [57]

    Bcat: A block causal transformer for pde foundation models for fluid dynamics.arXiv preprint arXiv:2501.18972, 2025

    Yuxuan Liu, Jingmin Sun, and Hayden Schaeffer. Bcat: A block causal transformer for pde foundation models for fluid dynamics.arXiv preprint arXiv:2501.18972, 2025

  58. [58]

    Latent diffusion for language generation

    Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. Advancesin Neural Information Processing Systems, 36:56998–57025, 2023

  59. [59]

    Tess: Text-to-text self-conditioned simplex diffusion

    Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024

  60. [60]

    Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

    Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

  61. [61]

    Language model evaluation beyond perplexity

    Clara Meister and Ryan Cotterell. Language model evaluation beyond perplexity. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5328–5339, 2021

  62. [62]

    Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022

    Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022

  63. [63]

    Cosmos: Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170, 2025

    Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170, 2025

  64. [64]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  65. [65]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

  66. [66]

    Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023

    Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023

  67. [67]

    Pass: Parallel speculative sampling,

    Giovanni Monea, Armand Joulin, and Edouard Grave. Pass: Parallel speculative sampling.arXiv preprint arXiv:2311.13581, 2023

  68. [68]

    A corpus and cloze evaluation for deeper understanding of commonsense stories

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

  69. [69]

    Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514,

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

  70. [70]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  71. [71]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  72. [72]

    arXiv preprint arXiv:2406.03736 , year=

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

  73. [73]

    A survey of LLM inference systems.arXiv preprint arXiv:2506.21901, 2025

    James Pan and Guoliang Li. A survey of llm inference systems.arXiv preprint arXiv:2506.21901, 2025

  74. [74]

    The lambada dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525...

  75. [75]

    Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts

    Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, and Changick Kim. Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts. InEuropean Conference on Computer Vision, pages 461–477. Springer, 2024

  76. [76]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  77. [77]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  78. [78]

    Squad: 100,000+ questions for ma- chine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for ma- chine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

  79. [79]

    Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022

    Pierre H Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022

  80. [80]

    Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024

Showing first 80 references.