Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Pith reviewed 2026-05-20 19:54 UTC · model grok-4.3
The pith
Autoregressive models can be converted into competitive diffusion language models through continual pre-training at scales up to 7B parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By identifying links between AR and diffusion modeling objectives, the authors introduce a continual pre-training procedure that converts AR models ranging from 127M to 7B parameters into diffusion models called DiffuGPT and DiffuLLaMA; these adapted models outperform earlier diffusion language models and remain competitive with their AR origins on standard benchmarks after training on fewer than 200B tokens.
What carries the argument
Continual pre-training that transfers AR models to diffusion objectives by aligning their respective loss formulations and generation processes.
If this is right
- Diffusion language models become feasible at the same parameter counts where autoregressive models are currently dominant.
- Practitioners can reuse existing AR checkpoints to obtain models that support infilling and other non-left-to-right generation without reordering prompts.
- The performance gap between diffusion and autoregressive paradigms narrows when both start from the same pretrained base.
- Training compute for new diffusion models can be reduced to a small fraction of what would be required from random initialization.
Where Pith is reading between the lines
- This adaptation route could be applied to other non-autoregressive paradigms by first aligning their objectives to those of mature AR models.
- The result suggests that the main remaining differences between AR and diffusion models lie in sampling efficiency and controllable generation rather than in fundamental capacity.
- Future scaling studies could test whether the same conversion works when starting from instruction-tuned or multimodal AR bases.
Load-bearing premise
The assumption that objective connections between autoregressive and diffusion training allow adaptation to preserve competitive performance without major degradation at any scale.
What would settle it
A controlled experiment in which the adapted diffusion models show large, consistent drops in perplexity or benchmark scores relative to their AR starting points after the described continual pre-training.
read the original abstract
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions https://github.com/HKUNLP/DiffuLLaMA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes adapting autoregressive language models (GPT-2 and LLaMA families, ranging from 127M to 7B parameters) into diffusion language models (DiffuGPT and DiffuLLaMA) via a continual pre-training approach that leverages connections between AR and diffusion objectives. Using less than 200B tokens, the resulting models are evaluated on language modeling, reasoning, and commonsense benchmarks, where they outperform prior DLMs and remain competitive with their AR counterparts while supporting capabilities such as infilling without reordering and instruction following. The authors release the model suite and code.
Significance. If the adaptation results hold at scale, the work is significant for providing a practical route to large-scale diffusion language models by repurposing existing AR checkpoints, thereby addressing training challenges for DLMs. The systematic evaluation across multiple scales, the public release of 127M–7B models, and the demonstration of non-autoregressive generation features constitute concrete strengths that could accelerate research on alternatives to pure autoregressive text modeling.
major comments (3)
- [§4 (Experimental Results)] §4 (Experimental Results): The claim that DiffuLLaMA-7B remains competitive with the original LLaMA without major degradation rests on benchmark scores, but the manuscript provides no direct comparison of marginal likelihood or validation perplexity between the adapted diffusion model and the frozen AR baseline on the same held-out distribution, nor any ablation on token budget sufficiency below 200B tokens.
- [§3 (Adaptation Method)] §3 (Adaptation Method): While objective connections are used to justify continual pre-training, the paper omits intermediate checkpoint analysis or explicit measurement of how well the diffusion objective aligns with the original AR likelihood during adaptation, leaving open whether reported downstream competitiveness at 7B scale reflects true closure of the objective gap or evaluation masking.
- [Benchmark tables] Benchmark tables (e.g., language modeling and reasoning results): Outperformance over earlier DLMs and competitiveness claims lack reported error bars, precise data splits, and full hyperparameter details, which are load-bearing for assessing reliability of the scaling conclusions across model sizes.
minor comments (2)
- [Abstract and §3] The abstract and method sections could more explicitly state the exact token counts and training steps used for each model size (127M, 355M, 7B) to improve reproducibility.
- [Figures and §2] Figure captions and notation for the diffusion process would benefit from clearer cross-references to the AR objective equations to highlight the claimed connections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment point-by-point below, providing clarifications and indicating the revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [§4 (Experimental Results)] The claim that DiffuLLaMA-7B remains competitive with the original LLaMA without major degradation rests on benchmark scores, but the manuscript provides no direct comparison of marginal likelihood or validation perplexity between the adapted diffusion model and the frozen AR baseline on the same held-out distribution, nor any ablation on token budget sufficiency below 200B tokens.
Authors: We agree that a direct comparison of marginal likelihood or validation perplexity on a shared held-out set would provide additional evidence for the competitiveness claim. Exact marginal likelihood computation for diffusion models requires Monte Carlo approximations that are not directly equivalent to the AR negative log-likelihood, which complicates head-to-head reporting; our evaluation therefore relies on the standard suite of downstream benchmarks used throughout the diffusion LM literature. Regarding token budget, the 200B figure was chosen after smaller-scale pilot runs, but we did not include an explicit ablation in the main text. In the revision we will add a brief discussion of this limitation together with any available scaling curves from our internal experiments. revision: partial
-
Referee: [§3 (Adaptation Method)] While objective connections are used to justify continual pre-training, the paper omits intermediate checkpoint analysis or explicit measurement of how well the diffusion objective aligns with the original AR likelihood during adaptation, leaving open whether reported downstream competitiveness at 7B scale reflects true closure of the objective gap or evaluation masking.
Authors: Section 3 derives the formal connection between the AR and diffusion objectives to motivate the continual pre-training procedure. While intermediate checkpoint diagnostics were not reported in the initial submission, we can extract and include the diffusion training loss trajectory alongside the original AR loss evaluated on the same adaptation data. This addition will make the degree of objective alignment explicit and help rule out evaluation masking at the 7B scale. revision: yes
-
Referee: [Benchmark tables] Benchmark tables (e.g., language modeling and reasoning results): Outperformance over earlier DLMs and competitiveness claims lack reported error bars, precise data splits, and full hyperparameter details, which are load-bearing for assessing reliability of the scaling conclusions across model sizes.
Authors: We acknowledge that error bars, exact evaluation splits, and complete hyperparameter specifications are necessary for assessing the reliability of the scaling trends. In the revised manuscript we will (i) report standard deviations or confidence intervals for all main benchmark numbers where multiple runs are available, (ii) specify the precise train/validation/test splits and any decontamination steps, and (iii) add a dedicated appendix table listing all training hyperparameters for each model size. revision: yes
Circularity Check
Empirical adaptation and benchmark results are self-contained with no reduction to fitted inputs or self-citations
full rationale
The paper's derivation chain consists of demonstrating objective connections between autoregressive and diffusion modeling, followed by a continual pre-training procedure to adapt existing AR models (GPT2, LLaMA) into DiffuGPT and DiffuLLaMA. Performance claims rest on training runs using <200B tokens and direct evaluation against external benchmarks for language modeling, reasoning, and commonsense tasks. No equations or steps reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the present work. The central competitiveness result is an empirical outcome measured against independent baselines rather than a renaming or self-referential definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Connections between autoregressive and diffusion objectives permit effective continual pre-training without substantial performance loss.
Forward citations
Cited by 21 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Dynamic Chunking for Diffusion Language Models
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
-
Discrete Langevin-Inspired Posterior Sampling
ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
-
Coupling Models for One-Step Discrete Generation
Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Measuring Temporal Linguistic Emergence in Diffusion Language Models
In diffusion language models, coarse linguistic labels stabilize earlier than exact token identity, uncertainty tracks correctness, and mid-trajectory states are most sensitive to perturbations.
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
-
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
-
Diffusion Language Models Know the Answer Before Decoding
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
-
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.
-
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
AHD uses real-time stability monitoring with dynamic anchors to allow early cross-block decoding of converged tokens, cutting steps by up to 80% and raising performance on benchmarks like BBH.
-
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
Static checking rewards and moderate AST-based hints improve diffusion RL performance for code generation, with effectiveness varying by task difficulty across HumanEval, MBPP, and LiveCodeBench.
Reference graph
Works this paper leans on
-
[1]
Improved Denoising Diffusion Probabilistic Models , volume =
Alexander Quinn Nichol and Prafulla Dhariwal , booktitle =. Improved Denoising Diffusion Probabilistic Models , volume =
-
[2]
Denoising Diffusion Probabilistic Models , year =
Jonathan Ho and Ajay Jain and Pieter Abbeel , booktitle =. Denoising Diffusion Probabilistic Models , year =
-
[3]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Likelihood-Based Diffusion Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[4]
The Twelfth International Conference on Learning Representations , year=
Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=
-
[5]
Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng , booktitle =
-
[6]
Diffusion-LM Improves Controllable Text Generation , year =
Li, Xiang Lisa and Thickstun, John and Gulrajani, Ishaan and Liang, Percy and Hashimoto, Tatsunori B , booktitle =. Diffusion-LM Improves Controllable Text Generation , year =
-
[7]
and Eisner, Jason , booktitle =
Lin, Chu-Cheng and Jaech, Aaron and Li, Xin and Gormley, Matthew R. and Eisner, Jason , booktitle =. Limitations of Autoregressive Models and Their Alternatives , year =
-
[8]
Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =
Lou, Aaron and Meng, Chenlin and Ermon, Stefano , booktitle =. Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =
-
[9]
Forty-first International Conference on Machine Learning, ICML , year=
The Pitfalls of Next-Token Prediction , author=. Forty-first International Conference on Machine Learning, ICML , year=
- [10]
-
[11]
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. ArXiv preprint , title =
-
[13]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , year =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Inform...
work page 2020
-
[15]
FlashAttention: Fast and Memory-Efficient Exact Attention with
Tri Dao and Daniel Y Fu and Stefano Ermon and Atri Rudra and Christopher Re , booktitle=. FlashAttention: Fast and Memory-Efficient Exact Attention with
-
[17]
International Conference on Learning Representations , year=
Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=
-
[18]
A Reparameterized Discrete Diffusion Model for Text Generation , year =
Lin Zheng and Jianbo Yuan and Lei Yu and Lingpeng Kong , booktitle =. A Reparameterized Discrete Diffusion Model for Text Generation , year =
-
[19]
International Conference on Learning Representations , year=
Autoregressive Diffusion Models , author=. International Conference on Learning Representations , year=
-
[20]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =
-
[22]
Susskind and Navdeep Jaitly , booktitle =
Yizhe Zhang and Jiatao Gu and Zhuofeng Wu and Shuangfei Zhai and Joshua M. Susskind and Navdeep Jaitly , booktitle =
-
[24]
Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =
Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , year =
-
[28]
Generative Modeling by Estimating Gradients of the Data Distribution , year =
Yang Song and Stefano Ermon , booktitle =. Generative Modeling by Estimating Gradients of the Data Distribution , year =
-
[29]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =
Jascha Sohl. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =. Proc. of ICML , editor =
-
[30]
Variational diffusion models , volume =
Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , journal =. Variational diffusion models , volume =
-
[32]
Attention is All you Need , volume =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , volume =
-
[34]
International Conference on Machine Learning, ICML , year=
CLLMs: Consistency Large Language Models , author=. International Conference on Machine Learning, ICML , year=
-
[38]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
-
[40]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=
work page 2024
-
[41]
Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =
-
[42]
Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =
Emiel Hoogeboom and Didrik Nielsen and Priyank Jaini and Patrick Forr. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , editor =
work page 2021
-
[43]
The Curious Case of Neural Text Degeneration , author=
-
[44]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[46]
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[47]
and Zettlemoyer, Luke , title =
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =
work page 2017
- [50]
-
[51]
Text summarization branches out , year=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , year=
-
[54]
OLMES: A Standard for Language Model Evaluations , author=. 2024 , eprint=
work page 2024
-
[56]
International Conference on Learning Representations , year =
Amortizing intractable inference in large language models , author =. International Conference on Learning Representations , year =
-
[57]
TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. ArXiv , year=
-
[58]
Prafulla Dhariwal and Alexander Quinn Nichol , booktitle=. Diffusion Models Beat
-
[59]
Proceedings of the 38th International Conference on Machine Learning , pages =
Zero-Shot Text-to-Image Generation , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
-
[60]
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
work page 2024
-
[61]
Dao, Tri , booktitle=. Flash
-
[62]
Transactions on Machine Learning Research , issn=
StarCoder: may the source be with you! , author=. Transactions on Machine Learning Research , issn=. 2023 , note=
work page 2023
-
[63]
Language Models are Unsupervised Multitask Learners , author=
-
[64]
Li, Yifan and Zhou, Kun and Zhao, Wayne Xin and Wen, Ji-Rong , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =
work page 2023
-
[66]
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =
Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan , journal =. Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =
-
[67]
OpenWebText Corpus , author=
-
[68]
TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=
work page 2024
-
[72]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo
-
[75]
The Eleventh International Conference on Learning Representations , year=
Continual Pre-training of Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[77]
Yukang Chen and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia , booktitle=. LongLo
-
[78]
The Twelfth International Conference on Learning Representations , year=
Lemur: Harmonizing Natural Language and Code for Language Agents , author=. The Twelfth International Conference on Learning Representations , year=
-
[79]
Transactions on Machine Learning Research , issn=
Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=
- [80]
-
[81]
Hierarchical text-conditional image generation with clip latents , volume =
Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal =. Hierarchical text-conditional image generation with clip latents , volume =
-
[82]
Denoising Diffusion Implicit Models , year =
Jiaming Song and Chenlin Meng and Stefano Ermon , booktitle =. Denoising Diffusion Implicit Models , year =
-
[83]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Lin, Zhenghao and Gong, Yeyun and Shen, Yelong and Wu, Tong and Fan, Zhihao and Lin, Chen and Duan, Nan and Chen, Weizhu , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[84]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =
-
[85]
Jiatao Gu and James Bradbury and Caiming Xiong and Victor O. K. Li and Richard Socher , booktitle =. Non-Autoregressive Neural Machine Translation , year =
-
[86]
First Conference on Language Modeling , year=
Do Language Models Plan Ahead for Future Tokens? , author=. First Conference on Language Modeling , year=
-
[88]
Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=
work page 2024
-
[89]
Forty-first International Conference on Machine Learning , year=
Better & Faster Large Language Models via Multi-token Prediction , author=. Forty-first International Conference on Machine Learning , year=
-
[90]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[91]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[92]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[94]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[96]
Forty-first International Conference on Machine Learning , year=
Self-Infilling Code Generation , author=. Forty-first International Conference on Machine Learning , year=
-
[98]
Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , booktitle=. MaskGIT: Masked Generative Image Transformer , year=
-
[99]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Fine-tuning by curriculum learning for non-autoregressive neural machine translation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[100]
Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2024 , journal=
work page 2024
-
[101]
First Conference on Language Modeling , year=
Stream of Search (SoS): Learning to Search in Language , author=. First Conference on Language Modeling , year=
-
[102]
Efficient Training of Language Models to Fill in the Middle , author=. 2022 , journal=
work page 2022
-
[103]
Efficient Continual Pre-training by Mitigating the Stability Gap , author=. 2024 , eprint=
work page 2024
-
[104]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Simple and Effective Masked Diffusion Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[105]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Pr...
work page 2021
-
[106]
The pitfalls of next-token prediction
Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Forty-first International Conference on Machine Learning, ICML, 2024
work page 2024
-
[107]
Efficient Training of Language Models to Fill in the Middle
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[108]
Efficient training of language models to fill in the middle, 2022 b
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022 b
work page 2022
-
[109]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[110]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[111]
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11305--11315, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.