Masked Diffusion Decoding as $x$-Prediction Flow

Akash Kumar; Cecilia De La Parra; Lianlei Shan; Shubham Rai; Weitian Wang

arxiv: 2606.29066 · v1 · pith:DD3ZAPAJnew · submitted 2026-06-27 · 💻 cs.CL

Masked Diffusion Decoding as x-Prediction Flow

Weitian Wang , Lianlei Shan , Shubham Rai , Cecilia De La Parra , Akash Kumar This is my paper

Pith reviewed 2026-06-30 09:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords masked diffusion language modelsx-prediction flowcontinuous decodingasynchronous updatereinforcement learning policyHumanEval benchmarkLLaDA modeldiffusion decoding efficiency

0 comments

The pith

Reinterpreting mask prediction as x-prediction induces a continuous flow in embedding space that lets tokens accumulate partial, revisable progress during masked diffusion decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard masked diffusion decoders force an all-or-nothing choice at each step, committing a position to one token or leaving it masked and discarding intermediate predictive signals. The paper shows that treating the mask predictor as an x-predictor produces a continuous flow through input embeddings, so each token can build up fractional belief across steps while staying open to revision. This flow is paired with token-wise asynchronous updates driven by per-position confidence and with a lightweight policy network trained by reinforcement learning to respect the uneven constraints typical of language. When the resulting decoder is applied to a pretrained model, it retains nearly all baseline quality while using far fewer steps. A reader would care because the change directly attacks the budget inefficiency that arises when diffusion models must generate under tight step limits.

Core claim

By reinterpreting mask prediction as clean-state (x) prediction, the standard binary unmasking process of masked diffusion language models can be replaced by a continuous flow in input embedding space. In this flow, each token position accumulates partial progress across diffusion steps and remains revisable rather than locked into an early irrevocable commitment. The global synchronous schedule is replaced by a confidence-based asynchronous update that respects position-specific contextual constraints, and a lightweight policy network trained via reinforcement learning selects which positions to advance. Applied to the pretrained LLaDA model, the resulting continuous decoder reaches 97 perc

What carries the argument

The x-prediction flow that converts each mask-prediction step into a continuous update of the clean-state embedding, allowing partial token representations to accumulate and be revised.

If this is right

Tokens receive updates asynchronously according to their individual confidence levels rather than a fixed global schedule.
A reinforcement-learned policy network can guide which positions advance at each step without requiring changes to the underlying pretrained model.
Generation quality is preserved under substantially reduced step counts by avoiding premature irrevocable token commitments.
The continuous representation in embedding space supplies richer intermediate signals than binary mask-or-unmask decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-space flow might be applied to other discrete diffusion models outside language to improve step efficiency.
Reduced decoding budgets could lower inference latency and energy cost for large-scale text generation without retraining the base model.
The revisable partial beliefs could be combined with external signals such as retrieval or constraint satisfaction during the diffusion process.

Load-bearing premise

Partial progress accumulated in embedding space via x-prediction flow accurately represents intermediate beliefs and can be revised without introducing compounding errors that the final discrete sampling cannot recover from.

What would settle it

Applying the continuous decoder to LLaDA on HumanEval and measuring whether performance stays at or above 97 percent of the discrete baseline when the step budget is reduced to 25 percent would directly test the central efficiency claim.

Figures

Figures reproduced from arXiv: 2606.29066 by Akash Kumar, Cecilia De La Parra, Lianlei Shan, Shubham Rai, Weitian Wang.

**Figure 2.** Figure 2: Training losses during x-prediction alignment. The MSE curve tracks masked and unmasked positions, while the CE curve compares the aligned model against the pretrained LLaDA reference at masked positions. Prompt filtering Rather than training on the full MBPP training split, we first run the pretrained LLaDA-8B-Instruct on every training problem and keep only the 164 problems that it can already solve. The… view at source ↗

read the original abstract

Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens, but their standard decoder reduces each step to a binary action: a position is either committed to a single token or left fully masked, with no representation of partial belief in between. This all-or-nothing regime discards rich predictive information and forces premature, irrevocable commitments, leading to poor performance under a limited decoding budget. In this paper, we reinterpret mask prediction as clean-state prediction ($x$-prediction) and show that it can be used to induce a continuous flow in input embedding space. Building on this view, we propose a continuous decoding framework for MDLMs where tokens can accumulate partial progress at each diffusion step and remain revisable. To match the uneven contextual constraints across positions in language, we replace the globally synchronous schedule in image diffusion with a confidence-based asynchronous update in which the diffusion progress is token-wise accumulated. Additionally, we introduce a lightweight policy network and formulate its training as a reinforcement learning problem. Applied to pretrained LLaDA, our continuous decoder reaches 97% of its performance on the HumanEval dataset with 25% of decoding budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes masked diffusion decoding as continuous x-prediction flow in embeddings with async updates and RL timing, but the efficiency claim has no visible experimental backing yet.

read the letter

The paper's main contribution is reinterpreting mask prediction in MDLMs as clean-state x-prediction to create a continuous flow in embedding space. This lets tokens accumulate partial progress across steps instead of binary commit-or-mask decisions, paired with confidence-based asynchronous scheduling and a lightweight RL policy to control commitment timing. Applied to LLaDA, it claims to hit 97% of baseline performance on HumanEval using only 25% of the decoding budget.

What stands out is the shift from synchronous global steps to token-wise accumulation and the use of RL for the policy. That addresses a real practical issue in diffusion LMs where compute budgets are tight and premature commitments hurt quality. The framing as x-prediction flow is a clean way to justify keeping states revisable.

The soft spots are in the evidence. The abstract states the performance result without error bars, ablations on the flow itself, or checks on whether embedding-space interpolation stays semantically valid. The base model was trained only on discrete masked prediction, so linear or policy-driven accumulation in embeddings has no direct training signal; drift could compound before the final discrete step. The stress-test note on revisable partial beliefs is worth taking seriously until the paper shows the trajectory actually corrects errors rather than locking them in.

This is for researchers already working on masked diffusion or efficient inference for language models. It deserves peer review because the idea is a coherent extension of existing MDLM work and the efficiency angle matters for deployment, even if the current write-up needs experiments to back the central assumption.

Referee Report

3 major / 2 minor

Summary. The manuscript reinterprets mask prediction in masked diffusion language models as clean-state (x) prediction to induce a continuous flow in input embedding space. It proposes a continuous decoder allowing tokens to accumulate partial progress across diffusion steps, using a confidence-based asynchronous (token-wise) update schedule in place of global synchrony, plus a lightweight policy network trained via reinforcement learning. Applied to the pretrained LLaDA model, the continuous decoder is reported to reach 97% of baseline performance on HumanEval while using only 25% of the decoding budget.

Significance. If the central assumption holds—that embedding-space accumulation via x-prediction produces revisable intermediate states whose errors remain correctable by final discrete sampling—the result would demonstrate a practical route to substantially lower inference cost for diffusion-based text generation under tight budgets. The work supplies a concrete empirical outcome on a held-out coding benchmark together with an explicit RL formulation for the policy, both of which are strengths.

major comments (3)

[Abstract / experimental results] Abstract and experimental section: the headline claim that the continuous decoder reaches 97% of LLaDA performance on HumanEval with 25% budget is presented without error bars, number of runs, ablation isolating the continuous-flow component from the asynchronous schedule or RL policy, or any direct measurement of whether intermediate embedding states remain semantically valid. This leaves the load-bearing performance result unsupported by the visible evidence.
[§3] §3 (reinterpretation as x-prediction flow): the claim that mask-to-clean prediction induces a continuous, revisable flow in embedding space rests on the untested assumption that linear or policy-driven interpolation between discrete embeddings produces intermediate states that accurately reflect partial beliefs. Because the base LLaDA model was trained exclusively on discrete masked-token objectives, no training signal guarantees semantic validity of these interpolations; accumulated drift under a 25% budget could therefore be irrecoverable by the final discrete sampling step.
[Policy network / RL formulation] Policy-network section: the RL objective is defined downstream of the embedding trajectory, so it can at best mitigate rather than prevent compounding interpolation errors. No analysis is supplied showing that the learned policy actually keeps trajectories within the region where final discrete recovery succeeds.

minor comments (2)

[Methods] Notation: the distinction between the original mask-prediction head and the reinterpreted x-prediction head should be made explicit with an equation or diagram early in the methods section.
[Introduction / Related work] The manuscript should include a short related-work paragraph contrasting the proposed asynchronous schedule with prior continuous or flow-based decoding methods in diffusion language models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating revisions where the manuscript will be updated to address the concerns.

read point-by-point responses

Referee: [Abstract / experimental results] Abstract and experimental section: the headline claim that the continuous decoder reaches 97% of LLaDA performance on HumanEval with 25% budget is presented without error bars, number of runs, ablation isolating the continuous-flow component from the asynchronous schedule or RL policy, or any direct measurement of whether intermediate embedding states remain semantically valid. This leaves the load-bearing performance result unsupported by the visible evidence.

Authors: We agree that error bars, explicit reporting of run counts, and component ablations would strengthen the empirical claims. In revision we will add these elements to the experimental section, including multiple-run statistics and ablations that isolate the continuous-flow, asynchronous schedule, and RL policy contributions. Direct measurement of intermediate embedding validity is not currently quantified; we will add a discussion of this gap together with any available proxy observations from the existing runs. revision: yes
Referee: [§3] §3 (reinterpretation as x-prediction flow): the claim that mask-to-clean prediction induces a continuous, revisable flow in embedding space rests on the untested assumption that linear or policy-driven interpolation between discrete embeddings produces intermediate states that accurately reflect partial beliefs. Because the base LLaDA model was trained exclusively on discrete masked-token objectives, no training signal guarantees semantic validity of these interpolations; accumulated drift under a 25% budget could therefore be irrecoverable by the final discrete sampling step.

Authors: The x-prediction reinterpretation follows from the mathematical structure of the diffusion process itself. While the base model was trained on discrete objectives, the empirical performance under reduced budget provides indirect support that the induced flow remains useful. We will revise §3 to state the assumption explicitly, discuss the risk of irrecoverable drift, and note that the final discrete sampling step is intended to correct residual errors. revision: partial
Referee: [Policy network / RL formulation] Policy-network section: the RL objective is defined downstream of the embedding trajectory, so it can at best mitigate rather than prevent compounding interpolation errors. No analysis is supplied showing that the learned policy actually keeps trajectories within the region where final discrete recovery succeeds.

Authors: The RL objective optimizes the policy for final-task reward, thereby selecting update decisions that empirically lead to successful recovery. We will add trajectory-level analysis in the revision (e.g., confidence evolution and comparison against non-RL schedules) to demonstrate that the learned policy favors recoverable paths. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result with independent content

full rationale

The paper's central claim is an empirical performance ratio (97% of baseline on HumanEval at 25% budget) obtained by applying a continuous decoder to a pretrained LLaDA model. No equations, fitted parameters, or self-citations are presented that reduce any prediction or uniqueness claim to the input data or prior author work by construction. The reinterpretation of mask prediction as x-prediction is introduced as a modeling choice whose validity is tested downstream on held-out code generation, not presupposed. The RL policy is trained on the same task objective, not on a circular fit. This is the common case of a self-contained applied result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.1-grok · 5740 in / 1010 out tokens · 28297 ms · 2026-06-30T09:24:20.674622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 13 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

2025
[4]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[7]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[15]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

2023
[18]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models.arXiv preprint arXiv:2510.17206, 2025. 11

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

2025

[4] [4]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[7] [7]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[13] [13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[15] [15]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

2023

[18] [18]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models.arXiv preprint arXiv:2510.17206, 2025. 11

work page arXiv 2025