arxiv: 2207.14255 · v1 · pith:42FNTVNGnew · submitted 2022-07-28 · 💻 cs.CL

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian , Heewoo Jun , Nikolas Tezak , John Schulman , Christine McLeavey , Jerry Tworek , Mark Chen This is my paper

Pith reviewed 2026-05-18 00:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords fill-in-the-middleinfillingautoregressive language modelsdata transformationleft-to-right generationperplexity evaluationsampling evaluationtext infilling

0 comments

The pith

A simple data transformation lets autoregressive language models learn to fill in the middle without losing left-to-right generation ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

By moving a span of text from the middle of a document to its end during training, autoregressive language models acquire the ability to infill that missing section. Experiments across model scales show that applying this change to a large portion of the data leaves standard left-to-right perplexity and sampling quality unchanged. Ablations test different frequencies for the transformation, different ways to structure it, and different methods for choosing the span. The authors recommend applying the technique by default and release both a trained model and new benchmarks.

Core claim

Autoregressive language models can learn to infill text after a straightforward transformation to the dataset that moves a span of text from the middle of a document to its end. Training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. The usefulness, simplicity, and efficiency of fill-in-the-middle training support the recommendation that future autoregressive language models be trained with this approach by default, guided by ablations on transformation frequency, structure, and span selection.

What carries the argument

The fill-in-the-middle data transformation that moves a selected contiguous span from the middle of a document to the end, allowing the model to train on prefix and suffix together.

If this is right

Future autoregressive models can gain infilling ability through data preprocessing alone while keeping their original generative performance.
Ablations identify practical defaults for how often to apply the transformation and how to select the span.
Standard left-to-right use cases remain fully supported because perplexity and sampling results stay comparable.
Released infilling benchmarks provide a common way to measure progress on this new capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that need to complete text given both preceding and following context become feasible with one model.
The same preprocessing step may improve robustness when real-world inputs contain gaps or edits.
The approach could extend naturally to other sequential domains such as code or structured documents.
Direct scaling comparisons between fill-in-the-middle and standard training at larger sizes would clarify any differences in efficiency.

Load-bearing premise

The chosen perplexity and sampling evaluations plus the ablations on transformation frequency and span selection are enough to show no harm to left-to-right generation and to recommend the method as a default.

What would settle it

A controlled comparison in which a fill-in-the-middle model shows clearly higher perplexity or lower sampling quality on standard left-to-right tasks than a baseline model trained without the transformation at the same scale.

read the original abstract

We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript shows that autoregressive language models can acquire fill-in-the-middle (FIM) capability through a simple dataset transformation that relocates a middle span to the document end. Extensive training runs and ablations across scales demonstrate that a large fraction of such transformed data preserves left-to-right generative performance, as quantified by perplexity on held-out text and sampling-based checks. The authors supply best-practice defaults for transformation frequency, span selection, and structure, and recommend training future autoregressive models with FIM by default; they release their strongest FIM model and infilling benchmarks.

Significance. If the empirical results hold, the work is significant because it supplies a low-overhead route for standard autoregressive models to support both left-to-right generation and infilling without architectural changes or separate training stages. The scale of the experiments, the systematic ablations on transformation frequency and span selection, and the public release of the model together with benchmarks constitute concrete strengths that increase the result's immediate utility and reproducibility.

major comments (1)

§4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.

minor comments (3)

§3.1: the definition of the FIM transformation frequency should explicitly state whether the fraction applies per-document or per-token, as this affects reproducibility of the reported ablations.
Table 1: the perplexity numbers for the largest scale would benefit from an additional column showing the delta relative to the non-FIM baseline to make the 'no harm' statement immediately quantifiable.
Figure 4: axis labels and legend entries use inconsistent abbreviations for the span-selection variants; standardizing them would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We address the single major comment below by clarifying the evaluation protocol and updating the manuscript accordingly.

read point-by-point responses

Referee: §4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.

Authors: We appreciate the referee's careful attention to this methodological detail. The perplexity metrics were computed on held-out documents using standard left-to-right autoregressive factorization for both the baseline and FIM-trained models, with identical tokenization, no FIM-specific control tokens or prefixes, and the same document-level prompting. Sampling evaluations likewise employed identical generation hyperparameters, temperature, and prompt formats across model variants. To eliminate any ambiguity regarding potential distribution shifts in mixed L2R/FIM usage, we have added an explicit paragraph in §4 describing the shared evaluation protocol and confirming that all reported L2R results reflect this consistent setup. This revision directly addresses the concern while preserving the original empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation

full rationale

The paper reports direct experimental results from training autoregressive models on datasets with a middle-to-end span transformation and measuring left-to-right perplexity plus sampling quality on held-out text across model scales. No equations, derivations, or fitted parameters are presented whose outputs are defined by the inputs; the central claim rests on external benchmarks rather than self-definition or self-citation chains. This is the expected non-finding for an empirical methods paper.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard autoregressive language-model training assumptions plus the specific data-augmentation procedure; no new entities or ad-hoc axioms are introduced beyond the usual modeling choices.

free parameters (2)

FIM transformation frequency
The fraction of training data that receives the middle-to-end move is a tunable hyperparameter explored via ablation.
infill span selection method
How the middle span is chosen (random, prefix-suffix balanced, etc.) is ablated and affects the final recommendation.

pith-pipeline@v0.9.0 · 5733 in / 1117 out tokens · 36497 ms · 2026-05-18T00:36:10.434265+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales
IndisputableMonolith.Foundation.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
cs.GR 2026-05 unverdicted novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
cs.CR 2026-02 conditional novelty 7.0

NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
InCoder: A Generative Model for Code Infilling and Synthesis
cs.SE 2022-04 unverdicted novelty 7.0

InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
cs.SE 2026-05 unverdicted novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
RuC: HDL-Agnostic Rule Completion Benchmark Generation
cs.AR 2026-04 unverdicted novelty 6.0

RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-...
CPT: Controllable and Editable Design Variations with Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CPT is a fine-tuned language model that uses Creative Markup Language representations of professional designs to generate controllable, stylistically coherent, and fully editable design variations.
Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production
cs.CL 2025-11 unverdicted novelty 6.0

A new past-conditioned future predictability measure best explains phonetic reduction and substitution error identity in naturalistic speech, subsuming backward predictability.
Mercury: Ultra-Fast Language Models Based on Diffusion
cs.CL 2025-06 unverdicted novelty 6.0

Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.
Qwen2.5-1M Technical Report
cs.CL 2025-01 accept novelty 6.0

Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
cs.CR 2025-10 unverdicted novelty 5.0

SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
Smart Paste: Automatically Fixing Copy/Paste for Google Developers
cs.SE 2025-10 unverdicted novelty 5.0

Smart Paste applies deep learning to predict and suggest post-paste code edits in Google's IDE, achieving 45% acceptance and contributing over 1% of all code written company-wide after deployment.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Continuous diffusion for categorical data
cs.CL 2022-11 unverdicted novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
Qwen2.5-Coder Technical Report
cs.CL 2024-09 unverdicted novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 20 Pith papers · 25 internal anchors

[1]

arXiv preprint

Hurdles to Progress in Long-form Question Answering , author=. arXiv preprint

work page
[2]

Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , journal=

work page
[3]

arXiv preprint

Language models are few-shot learners , author=. arXiv preprint

work page
[5]

Payne, Christine , journal=

work page
[7]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Improving language understanding by generative pre-training , author=

work page
[12]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[13]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint

Learning to summarize from human feedback , author=. arXiv preprint

work page
[20]

White Paper

Jurassic-1: Technical details and evaluation , author=. White Paper. AI21 Labs , year=

work page
[22]

Truthful

Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=. Truthful

work page
[23]

Retrieval-augmented generation for knowledge-intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-augmented generation for knowledge-intensive. arXiv preprint

work page
[25]

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=

work page
[26]

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=. Trivia

work page
[27]

Think you have Solved Direct-Answer Question Answering?

Bhakthavatsalam, Sumithra and Khashabi, Daniel and Khot, Tushar and Mishra, Bhavana Dalvi and Richardson, Kyle and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind and Clark, Peter , journal=. Think you have Solved Direct-Answer Question Answering?

work page
[28]

arXiv preprint

Question and answer test-train overlap in open-domain question answering datasets , author=. arXiv preprint

work page
[29]

Cheng, Hao and Shen, Yelong and Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng , journal=. United

work page
[30]

arXiv preprint

Proximal policy optimization algorithms , author=. arXiv preprint

work page
[31]

Truthful

Evans, Owain and Cotton-Barratt, Owen and Finnveden, Lukas and Bales, Adam and Balwit, Avital and Wills, Peter and Righetti, Luca and Saunders, William , journal=. Truthful

work page
[32]

arXiv preprint

On faithfulness and factuality in abstractive summarization , author=. arXiv preprint

work page
[33]

arXiv preprint

Retrieval Augmentation Reduces Hallucination in Conversation , author=. arXiv preprint

work page
[34]

arXiv preprint

Supervising strong learners by amplifying weak experts , author=. arXiv preprint

work page
[35]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=

work page
[36]

arXiv preprint

Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint

work page
[37]

2014 , publisher=

Superintelligence: Paths, Dangers, Strategies , author=. 2014 , publisher=

work page 2014
[38]

arXiv preprint

Rethinking Search: Making Experts out of Dilettantes , author=. arXiv preprint

work page
[39]

arXiv preprint

Finding generalizable evidence by learning to convince q&a models , author=. arXiv preprint

work page
[40]

Journal of the American Medical Informatics Association , volume=

Automation bias: a systematic review of frequency, effect mediators, and mitigators , author=. Journal of the American Medical Informatics Association , volume=. 2012 , publisher=

work page 2012
[41]

SIAM journal on control and optimization , volume=

Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=

work page 1992
[42]

2016 , publisher=

Crystal Society , author=. 2016 , publisher=

work page 2016
[43]

AI magazine , volume=

Building Watson: An overview of the DeepQA project , author=. AI magazine , volume=

work page
[44]

arXiv preprint

Boosting search engines with interactive agents , author=. arXiv preprint

work page
[45]

arXiv preprint

Interactive machine comprehension with information seeking agents , author=. arXiv preprint

work page
[46]

arXiv preprint

Dense passage retrieval for open-domain question answering , author=. arXiv preprint

work page
[47]

arXiv preprint

Improving information extraction by acquiring external evidence with reinforcement learning , author=. arXiv preprint

work page
[48]

International Conference on Machine Learning , pages=

World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[49]

arXiv preprint

Learning to navigate the web , author=. arXiv preprint

work page
[50]

Framing theory , author=. Annu. Rev. Polit. Sci. , volume=. 2007 , publisher=

work page 2007
[51]

OpenAI and Bavarian, Mohammad and Jiang, Angela and Jun, Heewoo and Pondé, Henrique , journal=

work page
[55]

and Davis, Ernest and Morgenstern, Leora , biburl =

Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora , biburl =. The. Proceedings of the

work page
[57]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page
[61]

C o QA : A Conversational Question Answering Challenge

Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

work page doi:10.1162/tacl_a_00266 2019
[62]

InCoder: A Generative Model for Code Infilling and Synthesis

Fried, Daniel and Aghajanyan, Armen and Lin, Jessy and Wang, Sida and Wallace, Eric and Shi, Freda and Zhong, Ruiqi and Yih, Wen-tau and Zettlemoyer, Luke and Lewis, Mike , keywords =. InCoder: A Generative Model for Code Infilling and Synthesis , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05999 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05999 2022
[64]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich and Barry Haddow and Alexandra Birch , title =. CoRR , volume =. 2015 , url =. 1508.07909 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[66]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[68]

Proceedings of the 37th International Conference on Machine Learning , pages =

Distribution Augmentation for Generative Modeling , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

work page 2020
[72]

Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages =

Marie. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages =. 2021 , url =

work page 2021
[74]

, title =

Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019
[75]

Proceedings of the 36th International Conference on Machine Learning , pages =

Insertion Transformer: Flexible Sequence Generation via Insertion Operations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[83]

MaskGAN: Better Text Generation via Filling in the______

Fedus, William and Goodfellow, Ian and Dai, Andrew M. , keywords =. MaskGAN: Better Text Generation via Filling in the\_\_\_\_\_\_ , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1801.07736 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.07736 2018
[85]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg

Hochreiter, Sepp and Schmidhuber, J\". Long Short-Term Memory , year =. Neural Comput. , month =. doi:10.1162/neco.1997.9.8.1735 , abstract =

work page doi:10.1162/neco.1997.9.8.1735 1997
[87]

Artetxe, J

Artetxe, Mikel and Du, Jingfei and Goyal, Naman and Zettlemoyer, Luke and Stoyanov, Ves , keywords =. On the Role of Bidirectionality in Language Model Pre-Training , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.11726 , url =

work page doi:10.48550/arxiv.2205.11726 2022
[88]

2022 , eprint=

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=. 2022 , eprint=

work page 2022
[91]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[92]

The Curious Case of Neural Text Degeneration

Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , biburl =. The Curious Case of Neural Text Degeneration. , url =. ICLR , ee =

work page
[93]

and Garcia, Xavier and Bahri, Dara and Schuster, Tal and Zheng, Huaixiu Steven and Houlsby, Neil and Metzler, Donald , keywords =

Tay, Yi and Dehghani, Mostafa and Tran, Vinh Q. and Garcia, Xavier and Bahri, Dara and Schuster, Tal and Zheng, Huaixiu Steven and Houlsby, Neil and Metzler, Donald , keywords =. Unifying Language Learning Paradigms , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.05131 , url =

work page doi:10.48550/arxiv.2205.05131 2022
[94]

Aghajanyan, D

A. Aghajanyan, D. Okhonko, M. Lewis, M. Joshi, H. Xu, G. Ghosh, and L. Zettlemoyer. HTLM: hyper-text pre-training and prompting of language models. CoRR, abs/2107.06955, 2021. URL https://arxiv.org/abs/2107.06955

work page arXiv 2021
[95]

com/blog/continuous-batching-llm-inference

A. Aghajanyan, B. Huang, C. Ross, V. Karpukhin, H. Xu, N. Goyal, D. Okhonko, M. Joshi, G. Ghosh, M. Lewis, and L. Zettlemoyer. CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520, 2022. URL https://arxiv.org/abs/2201.07520

work page arXiv 2022
[96]

Artetxe, J

M. Artetxe, J. Du, N. Goyal, L. Zettlemoyer, and V. Stoyanov. On the role of bidirectionality in language model pre-training, 2022. URL https://arxiv.org/abs/2205.11726

work page arXiv 2022
[97]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[98]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint https://arxiv.org/pdf/2005.14165.pdf cyan arXiv:2005.14165 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[99]

W. Chan, N. Kitaev, K. Guu, M. Stern, and J. Uszkoreit. KERMIT: generative insertion-based modeling for sequences. CoRR, abs/1906.01604, 2019. URL http://arxiv.org/abs/1906.01604

work page internal anchor Pith review Pith/arXiv arXiv 1906
[100]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[101]

E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. Q u AC : Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174--2184, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1241. URL https://acla...

work page doi:10.18653/v1/d18-1241 2018
[102]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[103]

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1285. URL ht...

work page doi:10.18653/v1/p19-1285 2019
[104]

X. Deng, Y. Su, A. Lees, Y. Wu, C. Yu, and H. Sun. Reasonbert: Pre-trained to reason with distant supervision. CoRR, abs/2109.04912, 2021. URL https://arxiv.org/abs/2109.04912

work page arXiv 2021
[105]

Devlin, M.-W

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...

work page doi:10.18653/v1/n19-1423 2019
[106]

Donahue, M

C. Donahue, M. Lee, and P. Liang. Enabling language models to fill in the blanks. CoRR, abs/2005.05339, 2020. URL https://arxiv.org/abs/2005.05339

work page arXiv 2005
[107]

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021

work page arXiv 2021
[108]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v...

work page doi:10.18653/v1/2022.acl-long.26 2022
[109]

D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 2368--2378, M...

work page doi:10.18653/v1/n19-1246 2019
[110]

MaskGAN: Better Text Generation via Filling in the______

W. Fedus, I. Goodfellow, and A. M. Dai. Maskgan: Better text generation via filling in the\_\_\_\_\_\_, 2018. URL https://arxiv.org/abs/1801.07736

work page internal anchor Pith review Pith/arXiv arXiv 2018
[111]

InCoder: A Generative Model for Code Infilling and Synthesis

D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999

work page internal anchor Pith review Pith/arXiv arXiv 2022
[112]

J. Gu, Q. Liu, and K. Cho. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7: 0 661--676, 2019. doi:10.1162/tacl_a_00292. URL https://aclanthology.org/Q19-1042

work page doi:10.1162/tacl_a_00292 2019
[113]

Scaling Laws for Transfer

D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021

work page internal anchor Pith review arXiv 2021
[114]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[115]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In ICLR. OpenReview.net, 2020. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2020.html#HoltzmanBDFC20

work page 2020
[116]

Joshi, D

M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. S pan BERT : Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8: 0 64--77, 2020. doi:10.1162/tacl_a_00300. URL https://aclanthology.org/2020.tacl-1.5

work page doi:10.1162/tacl_a_00300 2020

Showing first 80 references.