arxiv: 2605.09820 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

Bian Sun , Kevin Zhai , Mubarak Shah , Zhenyi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion language modelsflexible-length generationBayesian inferencestructured decodingtraining-free methoddynamic block expansioncoherent variable-length output

0 comments

The pith

A training-free Bayesian framework enables diffusion language models to generate variable-length text by jointly inferring lengths, blocks, and schedules during decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that removes the need to specify output length before decoding begins in diffusion language models. It treats generation as a problem of inferring sequence structure on the fly using Bayesian updates that combine local uncertainty estimates with signals about block organization. This produces both the right length and the division of the output into coherent blocks without retraining the underlying model. A sympathetic reader would care because prior flexible-length approaches either required expensive retraining or relied on narrow local confidence checks that often broke global coherence. If successful, the approach makes diffusion models more usable in open-ended generation tasks where length cannot be predetermined.

Core claim

The central claim is that flexible-length generation in diffusion language models can be cast as a dynamic structural inference problem solved through Bayesian methods. At each window expansion step the framework integrates local uncertainty with structural signals in a single mechanism to compute the expansion length, the block boundaries, and the decoding schedule, thereby supporting both flexible block expansion and block organization while preserving coherence across the full output.

What carries the argument

Dystruct, a training-free Bayesian structured decoding framework that jointly infers expansion length, block boundaries, and decoding schedule by unifying local uncertainty signals with structural information at each expansion step.

If this is right

Generation quality and flexibility improve over both fixed-length diffusion models and prior flexible-length methods across multiple benchmarks.
The model can dynamically expand and organize blocks while keeping overall coherence without post-hoc tuning.
No retraining or architectural changes to the base diffusion language model are required.
The same Bayesian update step determines length, boundaries, and schedule in one pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar Bayesian structural inference could be tested on other non-autoregressive generation settings where length uncertainty also appears.
The method may help address coherence issues in long-form parallel decoding tasks that current local-signal approaches struggle with.
Applying the framework to specialized domains such as code or dialogue could reveal whether the structural signals capture domain-specific patterns automatically.

Load-bearing premise

Local uncertainty signals combined with structural signals through a unified Bayesian mechanism are sufficient to infer global sequence structure and maintain coherence in variable-length outputs without any model retraining.

What would settle it

If side-by-side experiments on standard benchmarks show that Dystruct outputs receive lower quality scores or exhibit more coherence failures than fixed-length diffusion baselines once length is left free to vary, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.09820 by Bian Sun, Kevin Zhai, Mubarak Shah, Zhenyi Wang.

**Figure 1.** Figure 1: Overview of DyStruct. The framework performs flexible-length decoding by iteratively appending masked windows and executing structural inference. (a) Window Expansion: The next window size adaptively scales based on the mean instability (h¯) of previously decoded tokens. (b) CRP-Style Partitioning: A short temporary pass extracts token-level instability scores (hj ), which a CRP-style prior uses to partiti… view at source ↗

**Figure 2.** Figure 2: Inference efficiency comparison. DyStruct achieves the lowest inference time across different backbone models on the GSM8K dataset. Time is reported in seconds per iteration (s/it) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: DyStruct Resolves Boundary Fragmentation via Edge-Welding. Independent block decoding produces structurally incompatible boundaries. The predictive entropy spike triggers localized boundary repair to recover context-grounded syntax. (Red: incoherent variables; Green: localized repair.) window into two segments. The Gibbs schedule prioritizes the low-instability setup (Block 1). This order establishes a con… view at source ↗

**Figure 4.** Figure 4: DyStruct Isolates Logical Transitions via Partitioning. The framework splits the unanchored window to isolate segments with high instability scores. Prioritizing Block 1 provides stable conditioning before refining the logical evaluation in Block 2. (Blue: low-instability segment; Red: high-instability deduction.) When the generated sequence contains causal dependencies, the scheduler resolves terminal anc… view at source ↗

**Figure 5.** Figure 5: DyStruct Multi-Block Scheduling via Stable Anchors. The scheduler prioritizes both terminal anchor blocks (1 and 3) to establish a constrained context for the high-instability inferential resolution in Block 2. (Blue: stable anchors; Red: high-instability inference.) 6 Conclusion This paper presents a principled Bayesian framework for flexible-length diffusion language models (DLMs). We formulate flexible-… view at source ↗

read the original abstract

Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive models, primarily due to their ability to enable parallel decoding. Despite this advantage, most existing DLMs rely on a fixed generation length specified prior to decoding, which restricts their flexibility in real-world applications. While a few recent works attempt to support flexible-length generation, they typically suffer from notable limitations: some require costly retraining to accommodate variable-length outputs, while others depend solely on local confidence signals during decoding. Such local criteria fail to capture the evolving structure of the sequence, often resulting in suboptimal generation quality. In this paper, we propose a training-free, Bayesian structured decoding framework that formulates flexible-length generation as a dynamic structural inference problem. Our approach formulates flexible-length generation as a dynamic structural inference problem, jointly computing the expansion length, the block boundaries, and the decoding schedule. At each window expansion step, the method integrates local uncertainty with structural signals via a unified mechanism that supports dynamic structured generation, including both flexible block expansion and block organization, while maintaining coherence. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves generation quality and flexibility over existing fixed-length and flexible-length baselines. These results highlight the advantage of Bayesian structured decoding for diffusion language model, providing a principled and efficient solution for structured text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dystruct, a training-free Bayesian structured decoding framework for diffusion language models. It formulates flexible-length generation as a dynamic structural inference problem that jointly computes expansion length, block boundaries, and decoding schedule by integrating local uncertainty signals with structural signals at each window expansion step, with the goal of maintaining coherence across variable-length outputs. Extensive experiments on multiple benchmarks are claimed to show quality and flexibility gains over fixed-length and flexible-length baselines.

Significance. If the empirical results hold under rigorous validation, the work would be significant because it offers a principled, training-free mechanism to overcome the fixed-length restriction that limits most diffusion language models, potentially improving their practicality for real-world applications without the cost of retraining. The unified Bayesian treatment of local uncertainty and global structure is a conceptually clean contribution that could inform future non-autoregressive decoding methods.

major comments (2)

[Experiments section] The central empirical claim (abstract and Experiments section) rests on reported quality and flexibility gains, yet the manuscript provides no error bars, statistical significance tests, number of random seeds, or ablation studies isolating the contribution of the Bayesian structural inference versus simpler local-confidence baselines. This makes it impossible to determine whether the gains are robust or attributable to the proposed mechanism.
[Method section] The method description (Method section) states that local uncertainty and structural signals are combined via a 'unified Bayesian mechanism' to jointly infer expansion length, block boundaries, and schedule, but no explicit update equations, prior definitions, or likelihood formulations are supplied. Without these, the claim of a 'principled' inference procedure cannot be verified or reproduced.

minor comments (1)

[Abstract] The abstract contains a redundant sentence repeating the phrase 'formulates flexible-length generation as a dynamic structural inference problem.'

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Experiments section] The central empirical claim (abstract and Experiments section) rests on reported quality and flexibility gains, yet the manuscript provides no error bars, statistical significance tests, number of random seeds, or ablation studies isolating the contribution of the Bayesian structural inference versus simpler local-confidence baselines. This makes it impossible to determine whether the gains are robust or attributable to the proposed mechanism.

Authors: We acknowledge the validity of this observation. The current version reports aggregate performance metrics without accompanying statistical details or targeted ablations. In the revised manuscript, we will rerun all experiments with multiple random seeds, include error bars and standard deviations, conduct statistical significance tests (e.g., paired t-tests), and add an ablation study that isolates the Bayesian structural inference component against a local-confidence-only baseline. These additions will allow readers to assess the robustness and specific contribution of the proposed mechanism. revision: yes
Referee: [Method section] The method description (Method section) states that local uncertainty and structural signals are combined via a 'unified Bayesian mechanism' to jointly infer expansion length, block boundaries, and schedule, but no explicit update equations, prior definitions, or likelihood formulations are supplied. Without these, the claim of a 'principled' inference procedure cannot be verified or reproduced.

Authors: We agree that the Method section requires greater mathematical precision to substantiate the claim of a principled Bayesian procedure. In the revision, we will expand the description to include the explicit posterior update equations, the prior distribution over dynamic structural configurations (expansion lengths and block boundaries), and the likelihood model that incorporates local uncertainty signals at each window step. This will render the inference process fully specified and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed framework

full rationale

The paper introduces a training-free Bayesian structured decoding method that formulates flexible-length generation as a dynamic structural inference problem, integrating local uncertainty signals with structural signals to jointly determine expansion length, block boundaries, and decoding schedule. No equations, fitted parameters, or self-referential definitions appear in the provided abstract or description that would reduce the central claim to its own inputs by construction. The approach is presented as relying on external uncertainty and structural signals rather than internal fitting or prior self-citations that bear the load of the uniqueness or correctness of the inference mechanism. Experiments on benchmarks are invoked as independent validation, making the derivation self-contained without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available; ledger populated from stated high-level assumptions. No explicit free parameters or invented physical entities are named.

axioms (1)

domain assumption Local uncertainty and structural signals can be integrated into a unified Bayesian mechanism that infers global sequence structure
Invoked as the core of the dynamic structured generation process.

invented entities (1)

Dynamic structural inference problem no independent evidence
purpose: To model joint inference of length, boundaries, and schedule for flexible generation
Formulated explicitly as the central modeling choice; no external falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5535 in / 1188 out tokens · 52186 ms · 2026-05-12T02:00:54.442590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model the prior over these latent variables as a structured factorization: p(Z(t)) = p(Lt) p(P(t)|Lt,α) p(τ(t)|P(t)). ... CRP prior over block partitions ... p(bg=1|mg,αg(t)) = αg(t)/(mg+αg(t))
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Posterior over block partitions ... MAP inference arg max log p(P(t)|O(t),Lt,αg(t))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

work page 1901
[2]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[3]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[4]

Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo

Jaeyeon Kim, Lee Cheuk Kit, Carles Domingo-Enrich, Yilun Du, Sham M. Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo. Any-order flexible length masked diffusion. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[5]

Beyond masks: Efficient, flexible diffusion language models via deletion- insertion processes

Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, and Jiacheng Sun. Beyond masks: Efficient, flexible diffusion language models via deletion- insertion processes. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[6]

Beyond fixed: Training-free variable-length denoising for diffusion large language models

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Beyond fixed: Training-free variable-length denoising for diffusion large language models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies.Journal of the ACM (JACM), 57(2):1–30, 2010

David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies.Journal of the ACM (JACM), 57(2):1–30, 2010

work page 2010
[8]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[9]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in neural information processing systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in neural information processing systems, 34:12454–12465, 2021

work page 2021
[10]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

work page 2022
[11]

Latent diffusion energy-based model for interpretable text modelling

Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, and Ying Nian Wu. Latent diffusion energy-based model for interpretable text modelling. InInternational Conference on Machine Learning, pages 25702–25720. PMLR, 2022

work page 2022
[12]

Step-unrolled denoising autoencoders for text generation

Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, and Aaron van den Oord. Step-unrolled denoising autoencoders for text generation. InInternational Conference on Learning Representations, 2022

work page 2022
[13]

DiffusER: Diffusion via edit-based reconstruction

Machel Reid, Vincent Josua Hellendoorn, and Graham Neubig. DiffusER: Diffusion via edit-based reconstruction. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[14]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[15]

Diffusionbert: Improving generative masked language models with diffusion models

Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 4521–4534, 2023

work page 2023
[16]

Diffuseq: Sequence to sequence text generation with diffusion models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations, 2023. 15 Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

work page 2023
[17]

Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023

work page 2023
[18]

Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

work page 2024
[19]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InForty-first International Conference on Machine Learning, 2024

work page 2024
[20]

Xing, and Zhiting Hu

Guangyi Liu, Yu Wang, Zeyu Feng, Qiyu Wu, Liping Tang, Yuan Gao, Zhen Li, Shuguang Cui, Julian McAuley, Zichao Yang, Eric P. Xing, and Zhiting Hu. Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding. InForty-first International Conference on Machine Learning, 2024

work page 2024
[21]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[22]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Discrete copula diffusion

Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[24]

Beyond autoregression: Discrete diffusion for complex reasoning and planning

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Energy-based diffusion language models for text generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[27]

Beyond autoregression: Fast LLMs via self-distillation through time

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLMs via self-distillation through time. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[28]

Generalized interpolating discrete diffusion

Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion. InForty-second International Conference on Machine Learning, 2025

work page 2025
[29]

Think while you generate: Discrete diffusion with planned denoising

Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stark, Yilun Xu, Tommi Jaakkola, and Rafael Gomez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[30]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[31]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

The diffusion duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025

work page 2025
[33]

Susskind, and Navdeep Jaitly

Ruixiang ZHANG, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua M. Susskind, and Navdeep Jaitly. Target concrete score matching: A holistic framework for discrete diffusion. InForty-second International Conference on Machine Learning, 2025

work page 2025
[34]

Kakade, and Sitan Chen

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InForty-second International Conference on Machine Learning, 2025

work page 2025
[35]

Anchored diffusion language model

Litu Rout, Constantine Caramanis, and Sanjay Shakkottai. Anchored diffusion language model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[36]

Fast and fluent diffusion language models via convolutional decoding and rejective fine-tuning

Yeongbin Seo, Dongha Lee, Jaehyung Kim, and Jinyoung Yeo. Fast and fluent diffusion language models via convolutional decoding and rejective fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 16 Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

work page 2025
[37]

Self-speculative decoding accelerates lossless inference in any-order and any-subset autoregressive models

Gabe Guo and Stefano Ermon. Self-speculative decoding accelerates lossless inference in any-order and any-subset autoregressive models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[38]

Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models

Xiaojing Qi, Lun Du, Xinyuan Zhang, Lanning Wei, Tao Jin, and Da Zheng. Hierarchy decoding: A training-free parallel decoding strategy for diffusion large language models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[39]

Adablock-dLLM: Semantic-aware diffusion LLM inference via adaptive block size

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, and Hongxiang Fan. Adablock-dLLM: Semantic-aware diffusion LLM inference via adaptive block size. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[40]

When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

Danny Wang, Ruihong Qiu, and Zi Huang. When to commit? towards variable-size self-contained blocks for discrete diffusion language models.arXiv preprint arXiv:2604.23994, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Dream 7b: Diffusion large language models, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025

work page 2025
[42]

A framework for few-shot language model evaluation, 12 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2023
[43]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[44]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[45]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021
[46]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, et al. Evaluating large language...

work page 2021
[47]

Challenging BIG-bench tasks and whether chain- of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain- of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: AC...

work page 2023