pith. machine review for the scientific record. sign in

arxiv: 2207.14255 · v1 · pith:42FNTVNGnew · submitted 2022-07-28 · 💻 cs.CL

Efficient Training of Language Models to Fill in the Middle

Pith reviewed 2026-05-18 00:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords fill-in-the-middleinfillingautoregressive language modelsdata transformationleft-to-right generationperplexity evaluationsampling evaluationtext infilling
0
0 comments X

The pith

A simple data transformation lets autoregressive language models learn to fill in the middle without losing left-to-right generation ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

By moving a span of text from the middle of a document to its end during training, autoregressive language models acquire the ability to infill that missing section. Experiments across model scales show that applying this change to a large portion of the data leaves standard left-to-right perplexity and sampling quality unchanged. Ablations test different frequencies for the transformation, different ways to structure it, and different methods for choosing the span. The authors recommend applying the technique by default and release both a trained model and new benchmarks.

Core claim

Autoregressive language models can learn to infill text after a straightforward transformation to the dataset that moves a span of text from the middle of a document to its end. Training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. The usefulness, simplicity, and efficiency of fill-in-the-middle training support the recommendation that future autoregressive language models be trained with this approach by default, guided by ablations on transformation frequency, structure, and span selection.

What carries the argument

The fill-in-the-middle data transformation that moves a selected contiguous span from the middle of a document to the end, allowing the model to train on prefix and suffix together.

If this is right

  • Future autoregressive models can gain infilling ability through data preprocessing alone while keeping their original generative performance.
  • Ablations identify practical defaults for how often to apply the transformation and how to select the span.
  • Standard left-to-right use cases remain fully supported because perplexity and sampling results stay comparable.
  • Released infilling benchmarks provide a common way to measure progress on this new capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that need to complete text given both preceding and following context become feasible with one model.
  • The same preprocessing step may improve robustness when real-world inputs contain gaps or edits.
  • The approach could extend naturally to other sequential domains such as code or structured documents.
  • Direct scaling comparisons between fill-in-the-middle and standard training at larger sizes would clarify any differences in efficiency.

Load-bearing premise

The chosen perplexity and sampling evaluations plus the ablations on transformation frequency and span selection are enough to show no harm to left-to-right generation and to recommend the method as a default.

What would settle it

A controlled comparison in which a fill-in-the-middle model shows clearly higher perplexity or lower sampling quality on standard left-to-right tasks than a baseline model trained without the transformation at the same scale.

read the original abstract

We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript shows that autoregressive language models can acquire fill-in-the-middle (FIM) capability through a simple dataset transformation that relocates a middle span to the document end. Extensive training runs and ablations across scales demonstrate that a large fraction of such transformed data preserves left-to-right generative performance, as quantified by perplexity on held-out text and sampling-based checks. The authors supply best-practice defaults for transformation frequency, span selection, and structure, and recommend training future autoregressive models with FIM by default; they release their strongest FIM model and infilling benchmarks.

Significance. If the empirical results hold, the work is significant because it supplies a low-overhead route for standard autoregressive models to support both left-to-right generation and infilling without architectural changes or separate training stages. The scale of the experiments, the systematic ablations on transformation frequency and span selection, and the public release of the model together with benchmarks constitute concrete strengths that increase the result's immediate utility and reproducibility.

major comments (1)
  1. §4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.
minor comments (3)
  1. §3.1: the definition of the FIM transformation frequency should explicitly state whether the fraction applies per-document or per-token, as this affects reproducibility of the reported ablations.
  2. Table 1: the perplexity numbers for the largest scale would benefit from an additional column showing the delta relative to the non-FIM baseline to make the 'no harm' statement immediately quantifiable.
  3. Figure 4: axis labels and legend entries use inconsistent abbreviations for the span-selection variants; standardizing them would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We address the single major comment below by clarifying the evaluation protocol and updating the manuscript accordingly.

read point-by-point responses
  1. Referee: §4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.

    Authors: We appreciate the referee's careful attention to this methodological detail. The perplexity metrics were computed on held-out documents using standard left-to-right autoregressive factorization for both the baseline and FIM-trained models, with identical tokenization, no FIM-specific control tokens or prefixes, and the same document-level prompting. Sampling evaluations likewise employed identical generation hyperparameters, temperature, and prompt formats across model variants. To eliminate any ambiguity regarding potential distribution shifts in mixed L2R/FIM usage, we have added an explicit paragraph in §4 describing the shared evaluation protocol and confirming that all reported L2R results reflect this consistent setup. This revision directly addresses the concern while preserving the original empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation

full rationale

The paper reports direct experimental results from training autoregressive models on datasets with a middle-to-end span transformation and measuring left-to-right perplexity plus sampling quality on held-out text across model scales. No equations, derivations, or fitted parameters are presented whose outputs are defined by the inputs; the central claim rests on external benchmarks rather than self-definition or self-citation chains. This is the expected non-finding for an empirical methods paper.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard autoregressive language-model training assumptions plus the specific data-augmentation procedure; no new entities or ad-hoc axioms are introduced beyond the usual modeling choices.

free parameters (2)
  • FIM transformation frequency
    The fraction of training data that receives the middle-to-end move is a tunable hyperparameter explored via ablation.
  • infill span selection method
    How the middle span is chosen (random, prefix-suffix balanced, etc.) is ablated and affects the final recommendation.

pith-pipeline@v0.9.0 · 5733 in / 1117 out tokens · 36497 ms · 2026-05-18T00:36:10.434265+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  2. MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

    cs.GR 2026-05 unverdicted novelty 7.0

    MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...

  3. Evaluating Non-English Developer Support in Machine Learning for Software Engineering

    cs.SE 2026-05 unverdicted novelty 7.0

    Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

  4. "Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs

    cs.CR 2026-02 conditional novelty 7.0

    NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.

  5. InCoder: A Generative Model for Code Infilling and Synthesis

    cs.SE 2022-04 unverdicted novelty 7.0

    InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...

  6. SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs

    cs.SE 2026-05 unverdicted novelty 6.0

    SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.

  7. RuC: HDL-Agnostic Rule Completion Benchmark Generation

    cs.AR 2026-04 unverdicted novelty 6.0

    RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-...

  8. CPT: Controllable and Editable Design Variations with Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    CPT is a fine-tuned language model that uses Creative Markup Language representations of professional designs to generate controllable, stylistically coherent, and fully editable design variations.

  9. Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

    cs.CL 2025-11 unverdicted novelty 6.0

    A new past-conditioned future predictability measure best explains phonetic reduction and substitution error identity in naturalistic speech, subsuming backward predictability.

  10. Mercury: Ultra-Fast Language Models Based on Diffusion

    cs.CL 2025-06 unverdicted novelty 6.0

    Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.

  11. Qwen2.5-1M Technical Report

    cs.CL 2025-01 accept novelty 6.0

    Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.

  12. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  13. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  14. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  15. SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models

    cs.CR 2025-10 unverdicted novelty 5.0

    SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.

  16. Smart Paste: Automatically Fixing Copy/Paste for Google Developers

    cs.SE 2025-10 unverdicted novelty 5.0

    Smart Paste applies deep learning to predict and suggest post-paste code edits in Google's IDE, achieving 45% acceptance and contributing over 1% of all code written company-wide after deployment.

  17. DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    cs.SE 2024-01 unverdicted novelty 5.0

    DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.

  18. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  19. Continuous diffusion for categorical data

    cs.CL 2022-11 unverdicted novelty 5.0

    The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.

  20. Qwen2.5-Coder Technical Report

    cs.CL 2024-09 unverdicted novelty 4.0

    Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 20 Pith papers · 25 internal anchors

  1. [1]

    arXiv preprint

    Hurdles to Progress in Long-form Question Answering , author=. arXiv preprint

  2. [2]

    Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , journal=

  3. [3]

    arXiv preprint

    Language models are few-shot learners , author=. arXiv preprint

  4. [5]

    Payne, Christine , journal=

  5. [7]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=

  6. [8]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  7. [11]

    Improving language understanding by generative pre-training , author=

  8. [12]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  9. [13]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  10. [15]

    arXiv preprint

    Learning to summarize from human feedback , author=. arXiv preprint

  11. [20]

    White Paper

    Jurassic-1: Technical details and evaluation , author=. White Paper. AI21 Labs , year=

  12. [22]

    Truthful

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=. Truthful

  13. [23]

    Retrieval-augmented generation for knowledge-intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-augmented generation for knowledge-intensive. arXiv preprint

  14. [25]

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=

  15. [26]

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=. Trivia

  16. [27]

    Think you have Solved Direct-Answer Question Answering?

    Bhakthavatsalam, Sumithra and Khashabi, Daniel and Khot, Tushar and Mishra, Bhavana Dalvi and Richardson, Kyle and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind and Clark, Peter , journal=. Think you have Solved Direct-Answer Question Answering?

  17. [28]

    arXiv preprint

    Question and answer test-train overlap in open-domain question answering datasets , author=. arXiv preprint

  18. [29]

    Cheng, Hao and Shen, Yelong and Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng , journal=. United

  19. [30]

    arXiv preprint

    Proximal policy optimization algorithms , author=. arXiv preprint

  20. [31]

    Truthful

    Evans, Owain and Cotton-Barratt, Owen and Finnveden, Lukas and Bales, Adam and Balwit, Avital and Wills, Peter and Righetti, Luca and Saunders, William , journal=. Truthful

  21. [32]

    arXiv preprint

    On faithfulness and factuality in abstractive summarization , author=. arXiv preprint

  22. [33]

    arXiv preprint

    Retrieval Augmentation Reduces Hallucination in Conversation , author=. arXiv preprint

  23. [34]

    arXiv preprint

    Supervising strong learners by amplifying weak experts , author=. arXiv preprint

  24. [35]

    Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=

  25. [36]

    arXiv preprint

    Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint

  26. [37]

    2014 , publisher=

    Superintelligence: Paths, Dangers, Strategies , author=. 2014 , publisher=

  27. [38]

    arXiv preprint

    Rethinking Search: Making Experts out of Dilettantes , author=. arXiv preprint

  28. [39]

    arXiv preprint

    Finding generalizable evidence by learning to convince q&a models , author=. arXiv preprint

  29. [40]

    Journal of the American Medical Informatics Association , volume=

    Automation bias: a systematic review of frequency, effect mediators, and mitigators , author=. Journal of the American Medical Informatics Association , volume=. 2012 , publisher=

  30. [41]

    SIAM journal on control and optimization , volume=

    Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=

  31. [42]

    2016 , publisher=

    Crystal Society , author=. 2016 , publisher=

  32. [43]

    AI magazine , volume=

    Building Watson: An overview of the DeepQA project , author=. AI magazine , volume=

  33. [44]

    arXiv preprint

    Boosting search engines with interactive agents , author=. arXiv preprint

  34. [45]

    arXiv preprint

    Interactive machine comprehension with information seeking agents , author=. arXiv preprint

  35. [46]

    arXiv preprint

    Dense passage retrieval for open-domain question answering , author=. arXiv preprint

  36. [47]

    arXiv preprint

    Improving information extraction by acquiring external evidence with reinforcement learning , author=. arXiv preprint

  37. [48]

    International Conference on Machine Learning , pages=

    World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  38. [49]

    arXiv preprint

    Learning to navigate the web , author=. arXiv preprint

  39. [50]

    Framing theory , author=. Annu. Rev. Polit. Sci. , volume=. 2007 , publisher=

  40. [51]

    OpenAI and Bavarian, Mohammad and Jiang, Angela and Jun, Heewoo and Pondé, Henrique , journal=

  41. [55]

    and Davis, Ernest and Morgenstern, Leora , biburl =

    Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora , biburl =. The. Proceedings of the

  42. [57]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  43. [61]

    C o QA : A Conversational Question Answering Challenge

    Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

  44. [62]

    InCoder: A Generative Model for Code Infilling and Synthesis

    Fried, Daniel and Aghajanyan, Armen and Lin, Jessy and Wang, Sida and Wallace, Eric and Shi, Freda and Zhong, Ruiqi and Yih, Wen-tau and Zettlemoyer, Luke and Lewis, Mike , keywords =. InCoder: A Generative Model for Code Infilling and Synthesis , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05999 , url =

  45. [64]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich and Barry Haddow and Alexandra Birch , title =. CoRR , volume =. 2015 , url =. 1508.07909 , timestamp =

  46. [66]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  47. [68]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Distribution Augmentation for Generative Modeling , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  48. [72]

    Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages =

    Marie. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages =. 2021 , url =

  49. [74]

    , title =

    Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

  50. [75]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Insertion Transformer: Flexible Sequence Generation via Insertion Operations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  51. [83]

    MaskGAN: Better Text Generation via Filling in the______

    Fedus, William and Goodfellow, Ian and Dai, Andrew M. , keywords =. MaskGAN: Better Text Generation via Filling in the\_\_\_\_\_\_ , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1801.07736 , url =

  52. [85]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg

    Hochreiter, Sepp and Schmidhuber, J\". Long Short-Term Memory , year =. Neural Comput. , month =. doi:10.1162/neco.1997.9.8.1735 , abstract =

  53. [87]

    Artetxe, J

    Artetxe, Mikel and Du, Jingfei and Goyal, Naman and Zettlemoyer, Luke and Stoyanov, Ves , keywords =. On the Role of Bidirectionality in Language Model Pre-Training , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.11726 , url =

  54. [88]

    2022 , eprint=

    What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=. 2022 , eprint=

  55. [91]

    Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

  56. [92]

    The Curious Case of Neural Text Degeneration

    Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , biburl =. The Curious Case of Neural Text Degeneration. , url =. ICLR , ee =

  57. [93]

    and Garcia, Xavier and Bahri, Dara and Schuster, Tal and Zheng, Huaixiu Steven and Houlsby, Neil and Metzler, Donald , keywords =

    Tay, Yi and Dehghani, Mostafa and Tran, Vinh Q. and Garcia, Xavier and Bahri, Dara and Schuster, Tal and Zheng, Huaixiu Steven and Houlsby, Neil and Metzler, Donald , keywords =. Unifying Language Learning Paradigms , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.05131 , url =

  58. [94]

    Aghajanyan, D

    A. Aghajanyan, D. Okhonko, M. Lewis, M. Joshi, H. Xu, G. Ghosh, and L. Zettlemoyer. HTLM: hyper-text pre-training and prompting of language models. CoRR, abs/2107.06955, 2021. URL https://arxiv.org/abs/2107.06955

  59. [95]

    com/blog/continuous-batching-llm-inference

    A. Aghajanyan, B. Huang, C. Ross, V. Karpukhin, H. Xu, N. Goyal, D. Okhonko, M. Joshi, G. Ghosh, M. Lewis, and L. Zettlemoyer. CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520, 2022. URL https://arxiv.org/abs/2201.07520

  60. [96]

    Artetxe, J

    M. Artetxe, J. Du, N. Goyal, L. Zettlemoyer, and V. Stoyanov. On the role of bidirectionality in language model pre-training, 2022. URL https://arxiv.org/abs/2205.11726

  61. [97]

    Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  62. [98]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint https://arxiv.org/pdf/2005.14165.pdf cyan arXiv:2005.14165 , 2020

  63. [99]

    W. Chan, N. Kitaev, K. Guu, M. Stern, and J. Uszkoreit. KERMIT: generative insertion-based modeling for sequences. CoRR, abs/1906.01604, 2019. URL http://arxiv.org/abs/1906.01604

  64. [100]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

  65. [101]

    E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. Q u AC : Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174--2184, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1241. URL https://acla...

  66. [102]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  67. [103]

    Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1285. URL ht...

  68. [104]

    X. Deng, Y. Su, A. Lees, Y. Wu, C. Yu, and H. Sun. Reasonbert: Pre-trained to reason with distant supervision. CoRR, abs/2109.04912, 2021. URL https://arxiv.org/abs/2109.04912

  69. [105]

    Devlin, M.-W

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...

  70. [106]

    Donahue, M

    C. Donahue, M. Lee, and P. Liang. Enabling language models to fill in the blanks. CoRR, abs/2005.05339, 2020. URL https://arxiv.org/abs/2005.05339

  71. [107]

    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021

  72. [108]

    Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v...

  73. [109]

    D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 2368--2378, M...

  74. [110]

    MaskGAN: Better Text Generation via Filling in the______

    W. Fedus, I. Goodfellow, and A. M. Dai. Maskgan: Better text generation via filling in the\_\_\_\_\_\_, 2018. URL https://arxiv.org/abs/1801.07736

  75. [111]

    InCoder: A Generative Model for Code Infilling and Synthesis

    D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999

  76. [112]

    J. Gu, Q. Liu, and K. Cho. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7: 0 661--676, 2019. doi:10.1162/tacl_a_00292. URL https://aclanthology.org/Q19-1042

  77. [113]

    Scaling Laws for Transfer

    D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021

  78. [114]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  79. [115]

    Holtzman, J

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In ICLR. OpenReview.net, 2020. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2020.html#HoltzmanBDFC20

  80. [116]

    Joshi, D

    M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. S pan BERT : Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8: 0 64--77, 2020. doi:10.1162/tacl_a_00300. URL https://aclanthology.org/2020.tacl-1.5

Showing first 80 references.