Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Chenxin An; Hao Peng; Jiacheng Ye; Jiawei Han; Lingpeng Kong; Lin Zheng; Mukai Li; Peilin Zhao; Shansan Gong; Shivam Agarwal

arxiv: 2410.17891 · v3 · pith:GX4VJYB6new · submitted 2024-10-23 · 💻 cs.CL

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong , Shivam Agarwal , Yizhe Zhang , Jiacheng Ye , Lin Zheng , Mukai Li , Chenxin An , Peilin Zhao

show 4 more authors

Wei Bi Jiawei Han Hao Peng Lingpeng Kong

This is my paper

Pith reviewed 2026-05-20 19:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsautoregressive adaptationcontinual pre-trainingtext generationmodel scalinginfillingin-context learning

0 comments

The pith

Autoregressive models can be converted into competitive diffusion language models through continual pre-training at scales up to 7B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that connections between autoregressive and diffusion objectives support a straightforward adaptation process to turn existing AR models into diffusion ones. This method uses far less training data than building from scratch and produces models that match AR performance on language, reasoning, and commonsense tasks while adding diffusion-specific capabilities like infilling. A sympathetic reader would see this as a practical route to scaling diffusion language models without discarding the large body of pretrained AR checkpoints. If the adaptation holds across scales, it reduces the barrier to exploring non-autoregressive generation at the sizes that matter for real applications.

Core claim

By identifying links between AR and diffusion modeling objectives, the authors introduce a continual pre-training procedure that converts AR models ranging from 127M to 7B parameters into diffusion models called DiffuGPT and DiffuLLaMA; these adapted models outperform earlier diffusion language models and remain competitive with their AR origins on standard benchmarks after training on fewer than 200B tokens.

What carries the argument

Continual pre-training that transfers AR models to diffusion objectives by aligning their respective loss formulations and generation processes.

If this is right

Diffusion language models become feasible at the same parameter counts where autoregressive models are currently dominant.
Practitioners can reuse existing AR checkpoints to obtain models that support infilling and other non-left-to-right generation without reordering prompts.
The performance gap between diffusion and autoregressive paradigms narrows when both start from the same pretrained base.
Training compute for new diffusion models can be reduced to a small fraction of what would be required from random initialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This adaptation route could be applied to other non-autoregressive paradigms by first aligning their objectives to those of mature AR models.
The result suggests that the main remaining differences between AR and diffusion models lie in sampling efficiency and controllable generation rather than in fundamental capacity.
Future scaling studies could test whether the same conversion works when starting from instruction-tuned or multimodal AR bases.

Load-bearing premise

The assumption that objective connections between autoregressive and diffusion training allow adaptation to preserve competitive performance without major degradation at any scale.

What would settle it

A controlled experiment in which the adapted diffusion models show large, consistent drops in perplexity or benchmark scores relative to their AR starting points after the described continual pre-training.

read the original abstract

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions https://github.com/HKUNLP/DiffuLLaMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

You can adapt AR models up to 7B into diffusion versions with under 200B tokens and stay competitive, but tighter checks on likelihood and adaptation ablations would make the scaling claim more convincing.

read the letter

The main thing to know is that this paper gives a workable way to convert open autoregressive models like GPT-2 and LLaMA into diffusion models at scales up to 7B parameters. They use a continual pre-training step based on objective connections between the two paradigms and train with less than 200B tokens, then report results that beat earlier diffusion language models while staying close to the original AR performance on language modeling, reasoning, and commonsense tasks. The released models also show practical diffusion advantages like infilling without prompt reordering and solid in-context learning.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes adapting autoregressive language models (GPT-2 and LLaMA families, ranging from 127M to 7B parameters) into diffusion language models (DiffuGPT and DiffuLLaMA) via a continual pre-training approach that leverages connections between AR and diffusion objectives. Using less than 200B tokens, the resulting models are evaluated on language modeling, reasoning, and commonsense benchmarks, where they outperform prior DLMs and remain competitive with their AR counterparts while supporting capabilities such as infilling without reordering and instruction following. The authors release the model suite and code.

Significance. If the adaptation results hold at scale, the work is significant for providing a practical route to large-scale diffusion language models by repurposing existing AR checkpoints, thereby addressing training challenges for DLMs. The systematic evaluation across multiple scales, the public release of 127M–7B models, and the demonstration of non-autoregressive generation features constitute concrete strengths that could accelerate research on alternatives to pure autoregressive text modeling.

major comments (3)

[§4 (Experimental Results)] §4 (Experimental Results): The claim that DiffuLLaMA-7B remains competitive with the original LLaMA without major degradation rests on benchmark scores, but the manuscript provides no direct comparison of marginal likelihood or validation perplexity between the adapted diffusion model and the frozen AR baseline on the same held-out distribution, nor any ablation on token budget sufficiency below 200B tokens.
[§3 (Adaptation Method)] §3 (Adaptation Method): While objective connections are used to justify continual pre-training, the paper omits intermediate checkpoint analysis or explicit measurement of how well the diffusion objective aligns with the original AR likelihood during adaptation, leaving open whether reported downstream competitiveness at 7B scale reflects true closure of the objective gap or evaluation masking.
[Benchmark tables] Benchmark tables (e.g., language modeling and reasoning results): Outperformance over earlier DLMs and competitiveness claims lack reported error bars, precise data splits, and full hyperparameter details, which are load-bearing for assessing reliability of the scaling conclusions across model sizes.

minor comments (2)

[Abstract and §3] The abstract and method sections could more explicitly state the exact token counts and training steps used for each model size (127M, 355M, 7B) to improve reproducibility.
[Figures and §2] Figure captions and notation for the diffusion process would benefit from clearer cross-references to the AR objective equations to highlight the claimed connections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment point-by-point below, providing clarifications and indicating the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [§4 (Experimental Results)] The claim that DiffuLLaMA-7B remains competitive with the original LLaMA without major degradation rests on benchmark scores, but the manuscript provides no direct comparison of marginal likelihood or validation perplexity between the adapted diffusion model and the frozen AR baseline on the same held-out distribution, nor any ablation on token budget sufficiency below 200B tokens.

Authors: We agree that a direct comparison of marginal likelihood or validation perplexity on a shared held-out set would provide additional evidence for the competitiveness claim. Exact marginal likelihood computation for diffusion models requires Monte Carlo approximations that are not directly equivalent to the AR negative log-likelihood, which complicates head-to-head reporting; our evaluation therefore relies on the standard suite of downstream benchmarks used throughout the diffusion LM literature. Regarding token budget, the 200B figure was chosen after smaller-scale pilot runs, but we did not include an explicit ablation in the main text. In the revision we will add a brief discussion of this limitation together with any available scaling curves from our internal experiments. revision: partial
Referee: [§3 (Adaptation Method)] While objective connections are used to justify continual pre-training, the paper omits intermediate checkpoint analysis or explicit measurement of how well the diffusion objective aligns with the original AR likelihood during adaptation, leaving open whether reported downstream competitiveness at 7B scale reflects true closure of the objective gap or evaluation masking.

Authors: Section 3 derives the formal connection between the AR and diffusion objectives to motivate the continual pre-training procedure. While intermediate checkpoint diagnostics were not reported in the initial submission, we can extract and include the diffusion training loss trajectory alongside the original AR loss evaluated on the same adaptation data. This addition will make the degree of objective alignment explicit and help rule out evaluation masking at the 7B scale. revision: yes
Referee: [Benchmark tables] Benchmark tables (e.g., language modeling and reasoning results): Outperformance over earlier DLMs and competitiveness claims lack reported error bars, precise data splits, and full hyperparameter details, which are load-bearing for assessing reliability of the scaling conclusions across model sizes.

Authors: We acknowledge that error bars, exact evaluation splits, and complete hyperparameter specifications are necessary for assessing the reliability of the scaling trends. In the revised manuscript we will (i) report standard deviations or confidence intervals for all main benchmark numbers where multiple runs are available, (ii) specify the precise train/validation/test splits and any decontamination steps, and (iii) add a dedicated appendix table listing all training hyperparameters for each model size. revision: yes

Circularity Check

0 steps flagged

Empirical adaptation and benchmark results are self-contained with no reduction to fitted inputs or self-citations

full rationale

The paper's derivation chain consists of demonstrating objective connections between autoregressive and diffusion modeling, followed by a continual pre-training procedure to adapt existing AR models (GPT2, LLaMA) into DiffuGPT and DiffuLLaMA. Performance claims rest on training runs using <200B tokens and direct evaluation against external benchmarks for language modeling, reasoning, and commonsense tasks. No equations or steps reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the present work. The central competitiveness result is an empirical outcome measured against independent baselines rather than a renaming or self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical adaptation success rather than new theoretical constructs; it invokes standard machine learning assumptions about optimization and objective transfer.

axioms (1)

domain assumption Connections between autoregressive and diffusion objectives permit effective continual pre-training without substantial performance loss.
Invoked to justify the adaptation approach from AR models to DLMs.

pith-pipeline@v0.9.0 · 5786 in / 1292 out tokens · 55585 ms · 2026-05-20T19:54:51.769540+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Dynamic Chunking for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 7.0

The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
Discrete Langevin-Inspired Posterior Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
cs.CL 2025-05 conditional novelty 7.0

Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Measuring Temporal Linguistic Emergence in Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

In diffusion language models, coarse linguistic labels stabilize earlier than exact token identity, uncertainty tracks correctness, and mid-trajectory states are most sensitive to perturbations.
Differences in Text Generated by Diffusion and Autoregressive Language Models
cs.CL 2026-04 unverdicted novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
cs.AI 2025-10 unverdicted novelty 6.0

Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
Diffusion Language Models Know the Answer Before Decoding
cs.CL 2025-08 conditional novelty 6.0

DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
cs.CL 2026-05 unverdicted novelty 5.0

TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

AHD uses real-time stability monitoring with dynamic anchors to allow early cross-block decoding of converged tokens, cutting steps by up to 80% and raising performance on benchmarks like BBH.
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
cs.SE 2026-05 unverdicted novelty 4.0

Static checking rewards and moderate AST-based hints improve diffusion RL performance for code generation, with effectiveness varying by task difficulty across HumanEval, MBPP, and LiveCodeBench.

Reference graph

Works this paper leans on

176 extracted references · 176 canonical work pages · cited by 20 Pith papers · 14 internal anchors

[1]

Improved Denoising Diffusion Probabilistic Models , volume =

Alexander Quinn Nichol and Prafulla Dhariwal , booktitle =. Improved Denoising Diffusion Probabilistic Models , volume =

work page
[2]

Denoising Diffusion Probabilistic Models , year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , booktitle =. Denoising Diffusion Probabilistic Models , year =

work page
[3]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Likelihood-Based Diffusion Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[4]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=

work page
[5]

Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng , booktitle =

work page
[6]

Diffusion-LM Improves Controllable Text Generation , year =

Li, Xiang Lisa and Thickstun, John and Gulrajani, Ishaan and Liang, Percy and Hashimoto, Tatsunori B , booktitle =. Diffusion-LM Improves Controllable Text Generation , year =

work page
[7]

and Eisner, Jason , booktitle =

Lin, Chu-Cheng and Jaech, Aaron and Li, Xin and Gormley, Matthew R. and Eisner, Jason , booktitle =. Limitations of Autoregressive Models and Their Alternatives , year =

work page
[8]

Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =

Lou, Aaron and Meng, Chenlin and Ermon, Stefano , booktitle =. Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =

work page
[9]

Forty-first International Conference on Machine Learning, ICML , year=

The Pitfalls of Next-Token Prediction , author=. Forty-first International Conference on Machine Learning, ICML , year=

work page
[10]

Gpt-4 technical report , volume =

OpenAI , journal =. Gpt-4 technical report , volume =

work page
[11]

ArXiv preprint , title =

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. ArXiv preprint , title =

work page
[13]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , year =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Inform...

work page 2020
[15]

FlashAttention: Fast and Memory-Efficient Exact Attention with

Tri Dao and Daniel Y Fu and Stefano Ermon and Atri Rudra and Christopher Re , booktitle=. FlashAttention: Fast and Memory-Efficient Exact Attention with

work page
[17]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page
[18]

A Reparameterized Discrete Diffusion Model for Text Generation , year =

Lin Zheng and Jianbo Yuan and Lei Yu and Lingpeng Kong , booktitle =. A Reparameterized Discrete Diffusion Model for Text Generation , year =

work page
[19]

International Conference on Learning Representations , year=

Autoregressive Diffusion Models , author=. International Conference on Learning Representations , year=

work page
[20]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =

work page
[22]

Susskind and Navdeep Jaitly , booktitle =

Yizhe Zhang and Jiatao Gu and Zhuofeng Wu and Shuangfei Zhai and Joshua M. Susskind and Navdeep Jaitly , booktitle =

work page
[24]

Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =

Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , year =

work page
[28]

Generative Modeling by Estimating Gradients of the Data Distribution , year =

Yang Song and Stefano Ermon , booktitle =. Generative Modeling by Estimating Gradients of the Data Distribution , year =

work page
[29]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =

Jascha Sohl. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =. Proc. of ICML , editor =

work page
[30]

Variational diffusion models , volume =

Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , journal =. Variational diffusion models , volume =

work page
[32]

Attention is All you Need , volume =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , volume =

work page
[34]

International Conference on Machine Learning, ICML , year=

CLLMs: Consistency Large Language Models , author=. International Conference on Machine Learning, ICML , year=

work page
[38]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

work page
[40]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

work page 2024
[41]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

work page
[42]

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =

Emiel Hoogeboom and Didrik Nielsen and Priyank Jaini and Patrick Forr. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , editor =

work page 2021
[43]

The Curious Case of Neural Text Degeneration , author=

work page
[44]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page
[46]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page
[47]

and Zettlemoyer, Luke , title =

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =

work page 2017
[50]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page
[51]

Text summarization branches out , year=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , year=

work page
[54]

2024 , eprint=

OLMES: A Standard for Language Model Evaluations , author=. 2024 , eprint=

work page 2024
[56]

International Conference on Learning Representations , year =

Amortizing intractable inference in large language models , author =. International Conference on Learning Representations , year =

work page
[57]

ArXiv , year=

TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. ArXiv , year=

work page
[58]

Diffusion Models Beat

Prafulla Dhariwal and Alexander Quinn Nichol , booktitle=. Diffusion Models Beat

work page
[59]

Proceedings of the 38th International Conference on Machine Learning , pages =

Zero-Shot Text-to-Image Generation , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[60]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page 2024
[61]

Dao, Tri , booktitle=. Flash

work page
[62]

Transactions on Machine Learning Research , issn=

StarCoder: may the source be with you! , author=. Transactions on Machine Learning Research , issn=. 2023 , note=

work page 2023
[63]

Language Models are Unsupervised Multitask Learners , author=

work page
[64]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =

Li, Yifan and Zhou, Kun and Zhao, Wayne Xin and Wen, Ji-Rong , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =

work page 2023
[66]

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =

Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan , journal =. Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =

work page
[67]

OpenWebText Corpus , author=

work page
[68]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

work page 2024
[72]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page
[75]

The Eleventh International Conference on Learning Representations , year=

Continual Pre-training of Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[77]

Yukang Chen and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia , booktitle=. LongLo

work page
[78]

The Twelfth International Conference on Learning Representations , year=

Lemur: Harmonizing Natural Language and Code for Language Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[79]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=

work page
[80]

2022 , eprint=

A Survey for In-context Learning , author=. 2022 , eprint=

work page 2022
[81]

Hierarchical text-conditional image generation with clip latents , volume =

Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal =. Hierarchical text-conditional image generation with clip latents , volume =

work page
[82]

Denoising Diffusion Implicit Models , year =

Jiaming Song and Chenlin Meng and Stefano Ermon , booktitle =. Denoising Diffusion Implicit Models , year =

work page
[83]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Lin, Zhenghao and Gong, Yeyun and Shen, Yelong and Wu, Tong and Fan, Zhihao and Lin, Chen and Duan, Nan and Chen, Weizhu , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[84]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

work page
[85]

Jiatao Gu and James Bradbury and Caiming Xiong and Victor O. K. Li and Richard Socher , booktitle =. Non-Autoregressive Neural Machine Translation , year =

work page
[86]

First Conference on Language Modeling , year=

Do Language Models Plan Ahead for Future Tokens? , author=. First Conference on Language Modeling , year=

work page
[88]

2024 , eprint=

Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=

work page 2024
[89]

Forty-first International Conference on Machine Learning , year=

Better & Faster Large Language Models via Multi-token Prediction , author=. Forty-first International Conference on Machine Learning , year=

work page
[90]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[91]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[92]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[94]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[96]

Forty-first International Conference on Machine Learning , year=

Self-Infilling Code Generation , author=. Forty-first International Conference on Machine Learning , year=

work page
[98]

, booktitle=

Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , booktitle=. MaskGIT: Masked Generative Image Transformer , year=

work page
[99]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Fine-tuning by curriculum learning for non-autoregressive neural machine translation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[100]

2024 , journal=

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2024 , journal=

work page 2024
[101]

First Conference on Language Modeling , year=

Stream of Search (SoS): Learning to Search in Language , author=. First Conference on Language Modeling , year=

work page
[102]

2022 , journal=

Efficient Training of Language Models to Fill in the Middle , author=. 2022 , journal=

work page 2022
[103]

2024 , eprint=

Efficient Continual Pre-training by Mitigating the Stability Gap , author=. 2024 , eprint=

work page 2024
[104]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Simple and Effective Masked Diffusion Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[105]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Pr...

work page 2021
[106]

The pitfalls of next-token prediction

Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Forty-first International Conference on Machine Learning, ICML, 2024

work page 2024
[107]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[108]

Efficient training of language models to fill in the middle, 2022 b

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022 b

work page 2022
[109]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[110]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[111]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11305--11315, 2022

work page 2022

Showing first 80 references.

[1] [1]

Improved Denoising Diffusion Probabilistic Models , volume =

Alexander Quinn Nichol and Prafulla Dhariwal , booktitle =. Improved Denoising Diffusion Probabilistic Models , volume =

work page

[2] [2]

Denoising Diffusion Probabilistic Models , year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , booktitle =. Denoising Diffusion Probabilistic Models , year =

work page

[3] [3]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Likelihood-Based Diffusion Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[4] [4]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=

work page

[5] [5]

Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng , booktitle =

work page

[6] [6]

Diffusion-LM Improves Controllable Text Generation , year =

Li, Xiang Lisa and Thickstun, John and Gulrajani, Ishaan and Liang, Percy and Hashimoto, Tatsunori B , booktitle =. Diffusion-LM Improves Controllable Text Generation , year =

work page

[7] [7]

and Eisner, Jason , booktitle =

Lin, Chu-Cheng and Jaech, Aaron and Li, Xin and Gormley, Matthew R. and Eisner, Jason , booktitle =. Limitations of Autoregressive Models and Their Alternatives , year =

work page

[8] [8]

Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =

Lou, Aaron and Meng, Chenlin and Ermon, Stefano , booktitle =. Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution , year =

work page

[9] [9]

Forty-first International Conference on Machine Learning, ICML , year=

The Pitfalls of Next-Token Prediction , author=. Forty-first International Conference on Machine Learning, ICML , year=

work page

[10] [10]

Gpt-4 technical report , volume =

OpenAI , journal =. Gpt-4 technical report , volume =

work page

[11] [11]

ArXiv preprint , title =

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. ArXiv preprint , title =

work page

[12] [13]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , year =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Inform...

work page 2020

[13] [15]

FlashAttention: Fast and Memory-Efficient Exact Attention with

Tri Dao and Daniel Y Fu and Stefano Ermon and Atri Rudra and Christopher Re , booktitle=. FlashAttention: Fast and Memory-Efficient Exact Attention with

work page

[14] [17]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page

[15] [18]

A Reparameterized Discrete Diffusion Model for Text Generation , year =

Lin Zheng and Jianbo Yuan and Lei Yu and Lingpeng Kong , booktitle =. A Reparameterized Discrete Diffusion Model for Text Generation , year =

work page

[16] [19]

International Conference on Learning Representations , year=

Autoregressive Diffusion Models , author=. International Conference on Learning Representations , year=

work page

[17] [20]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , volume =

work page

[18] [22]

Susskind and Navdeep Jaitly , booktitle =

Yizhe Zhang and Jiatao Gu and Zhuofeng Wu and Shuangfei Zhai and Joshua M. Susskind and Navdeep Jaitly , booktitle =

work page

[19] [24]

Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =

Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , booktitle =. Structured Denoising Diffusion Models in Discrete State-Spaces , year =

work page

[20] [28]

Generative Modeling by Estimating Gradients of the Data Distribution , year =

Yang Song and Stefano Ermon , booktitle =. Generative Modeling by Estimating Gradients of the Data Distribution , year =

work page

[21] [29]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =

Jascha Sohl. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , volume =. Proc. of ICML , editor =

work page

[22] [30]

Variational diffusion models , volume =

Kingma, Diederik and Salimans, Tim and Poole, Ben and Ho, Jonathan , journal =. Variational diffusion models , volume =

work page

[23] [32]

Attention is All you Need , volume =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , volume =

work page

[24] [34]

International Conference on Machine Learning, ICML , year=

CLLMs: Consistency Large Language Models , author=. International Conference on Machine Learning, ICML , year=

work page

[25] [38]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

work page

[26] [40]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

work page 2024

[27] [41]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

work page

[28] [42]

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =

Emiel Hoogeboom and Didrik Nielsen and Priyank Jaini and Patrick Forr. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , year =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , editor =

work page 2021

[29] [43]

The Curious Case of Neural Text Degeneration , author=

work page

[30] [44]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page

[31] [46]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page

[32] [47]

and Zettlemoyer, Luke , title =

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =

work page 2017

[33] [50]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page

[34] [51]

Text summarization branches out , year=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , year=

work page

[35] [54]

2024 , eprint=

OLMES: A Standard for Language Model Evaluations , author=. 2024 , eprint=

work page 2024

[36] [56]

International Conference on Learning Representations , year =

Amortizing intractable inference in large language models , author =. International Conference on Learning Representations , year =

work page

[37] [57]

ArXiv , year=

TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. ArXiv , year=

work page

[38] [58]

Diffusion Models Beat

Prafulla Dhariwal and Alexander Quinn Nichol , booktitle=. Diffusion Models Beat

work page

[39] [59]

Proceedings of the 38th International Conference on Machine Learning , pages =

Zero-Shot Text-to-Image Generation , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

[40] [60]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page 2024

[41] [61]

Dao, Tri , booktitle=. Flash

work page

[42] [62]

Transactions on Machine Learning Research , issn=

StarCoder: may the source be with you! , author=. Transactions on Machine Learning Research , issn=. 2023 , note=

work page 2023

[43] [63]

Language Models are Unsupervised Multitask Learners , author=

work page

[44] [64]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =

Li, Yifan and Zhou, Kun and Zhao, Wayne Xin and Wen, Ji-Rong , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =

work page 2023

[45] [66]

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =

Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan , journal =. Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , volume =

work page

[46] [67]

OpenWebText Corpus , author=

work page

[47] [68]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

work page 2024

[48] [72]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page

[49] [75]

The Eleventh International Conference on Learning Representations , year=

Continual Pre-training of Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[50] [77]

Yukang Chen and Shengju Qian and Haotian Tang and Xin Lai and Zhijian Liu and Song Han and Jiaya Jia , booktitle=. LongLo

work page

[51] [78]

The Twelfth International Conference on Learning Representations , year=

Lemur: Harmonizing Natural Language and Code for Language Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page

[52] [79]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=

work page

[53] [80]

2022 , eprint=

A Survey for In-context Learning , author=. 2022 , eprint=

work page 2022

[54] [81]

Hierarchical text-conditional image generation with clip latents , volume =

Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal =. Hierarchical text-conditional image generation with clip latents , volume =

work page

[55] [82]

Denoising Diffusion Implicit Models , year =

Jiaming Song and Chenlin Meng and Stefano Ermon , booktitle =. Denoising Diffusion Implicit Models , year =

work page

[56] [83]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Lin, Zhenghao and Gong, Yeyun and Shen, Yelong and Wu, Tong and Fan, Zhihao and Lin, Chen and Duan, Nan and Chen, Weizhu , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[57] [84]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

work page

[58] [85]

Jiatao Gu and James Bradbury and Caiming Xiong and Victor O. K. Li and Richard Socher , booktitle =. Non-Autoregressive Neural Machine Translation , year =

work page

[59] [86]

First Conference on Language Modeling , year=

Do Language Models Plan Ahead for Future Tokens? , author=. First Conference on Language Modeling , year=

work page

[60] [88]

2024 , eprint=

Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=

work page 2024

[61] [89]

Forty-first International Conference on Machine Learning , year=

Better & Faster Large Language Models via Multi-token Prediction , author=. Forty-first International Conference on Machine Learning , year=

work page

[62] [90]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020

[63] [91]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page

[64] [92]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[65] [94]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[66] [96]

Forty-first International Conference on Machine Learning , year=

Self-Infilling Code Generation , author=. Forty-first International Conference on Machine Learning , year=

work page

[67] [98]

, booktitle=

Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , booktitle=. MaskGIT: Masked Generative Image Transformer , year=

work page

[68] [99]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Fine-tuning by curriculum learning for non-autoregressive neural machine translation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[69] [100]

2024 , journal=

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2024 , journal=

work page 2024

[70] [101]

First Conference on Language Modeling , year=

Stream of Search (SoS): Learning to Search in Language , author=. First Conference on Language Modeling , year=

work page

[71] [102]

2022 , journal=

Efficient Training of Language Models to Fill in the Middle , author=. 2022 , journal=

work page 2022

[72] [103]

2024 , eprint=

Efficient Continual Pre-training by Mitigating the Stability Gap , author=. 2024 , eprint=

work page 2024

[73] [104]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Simple and Effective Masked Diffusion Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[74] [105]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Pr...

work page 2021

[75] [106]

The pitfalls of next-token prediction

Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Forty-first International Conference on Machine Learning, ICML, 2024

work page 2024

[76] [107]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022

[77] [108]

Efficient training of language models to fill in the middle, 2022 b

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022 b

work page 2022

[78] [109]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020

[79] [110]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020

[80] [111]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11305--11315, 2022

work page 2022