Efficient Training of Language Models to Fill in the Middle
Pith reviewed 2026-05-18 00:36 UTC · model grok-4.3
The pith
A simple data transformation lets autoregressive language models learn to fill in the middle without losing left-to-right generation ability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autoregressive language models can learn to infill text after a straightforward transformation to the dataset that moves a span of text from the middle of a document to its end. Training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. The usefulness, simplicity, and efficiency of fill-in-the-middle training support the recommendation that future autoregressive language models be trained with this approach by default, guided by ablations on transformation frequency, structure, and span selection.
What carries the argument
The fill-in-the-middle data transformation that moves a selected contiguous span from the middle of a document to the end, allowing the model to train on prefix and suffix together.
If this is right
- Future autoregressive models can gain infilling ability through data preprocessing alone while keeping their original generative performance.
- Ablations identify practical defaults for how often to apply the transformation and how to select the span.
- Standard left-to-right use cases remain fully supported because perplexity and sampling results stay comparable.
- Released infilling benchmarks provide a common way to measure progress on this new capability.
Where Pith is reading between the lines
- Applications that need to complete text given both preceding and following context become feasible with one model.
- The same preprocessing step may improve robustness when real-world inputs contain gaps or edits.
- The approach could extend naturally to other sequential domains such as code or structured documents.
- Direct scaling comparisons between fill-in-the-middle and standard training at larger sizes would clarify any differences in efficiency.
Load-bearing premise
The chosen perplexity and sampling evaluations plus the ablations on transformation frequency and span selection are enough to show no harm to left-to-right generation and to recommend the method as a default.
What would settle it
A controlled comparison in which a fill-in-the-middle model shows clearly higher perplexity or lower sampling quality on standard left-to-right tasks than a baseline model trained without the transformation at the same scale.
read the original abstract
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript shows that autoregressive language models can acquire fill-in-the-middle (FIM) capability through a simple dataset transformation that relocates a middle span to the document end. Extensive training runs and ablations across scales demonstrate that a large fraction of such transformed data preserves left-to-right generative performance, as quantified by perplexity on held-out text and sampling-based checks. The authors supply best-practice defaults for transformation frequency, span selection, and structure, and recommend training future autoregressive models with FIM by default; they release their strongest FIM model and infilling benchmarks.
Significance. If the empirical results hold, the work is significant because it supplies a low-overhead route for standard autoregressive models to support both left-to-right generation and infilling without architectural changes or separate training stages. The scale of the experiments, the systematic ablations on transformation frequency and span selection, and the public release of the model together with benchmarks constitute concrete strengths that increase the result's immediate utility and reproducibility.
major comments (1)
- §4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.
minor comments (3)
- §3.1: the definition of the FIM transformation frequency should explicitly state whether the fraction applies per-document or per-token, as this affects reproducibility of the reported ablations.
- Table 1: the perplexity numbers for the largest scale would benefit from an additional column showing the delta relative to the non-FIM baseline to make the 'no harm' statement immediately quantifiable.
- Figure 4: axis labels and legend entries use inconsistent abbreviations for the span-selection variants; standardizing them would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation for minor revision. We address the single major comment below by clarifying the evaluation protocol and updating the manuscript accordingly.
read point-by-point responses
-
Referee: §4 (Experiments): the central claim that L2R capability is preserved rests on perplexity and sampling metrics; however, the manuscript does not report whether these metrics were computed under identical prompting conditions for FIM-trained versus baseline models, leaving open the possibility that apparent parity masks a distribution shift when the model is later used in mixed L2R/FIM settings.
Authors: We appreciate the referee's careful attention to this methodological detail. The perplexity metrics were computed on held-out documents using standard left-to-right autoregressive factorization for both the baseline and FIM-trained models, with identical tokenization, no FIM-specific control tokens or prefixes, and the same document-level prompting. Sampling evaluations likewise employed identical generation hyperparameters, temperature, and prompt formats across model variants. To eliminate any ambiguity regarding potential distribution shifts in mixed L2R/FIM usage, we have added an explicit paragraph in §4 describing the shared evaluation protocol and confirming that all reported L2R results reflect this consistent setup. This revision directly addresses the concern while preserving the original empirical claims. revision: yes
Circularity Check
No circularity: empirical training and held-out evaluation
full rationale
The paper reports direct experimental results from training autoregressive models on datasets with a middle-to-end span transformation and measuring left-to-right perplexity plus sampling quality on held-out text across model scales. No equations, derivations, or fitted parameters are presented whose outputs are defined by the inputs; the central claim rests on external benchmarks rather than self-definition or self-citation chains. This is the expected non-finding for an empirical methods paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- FIM transformation frequency
- infill span selection method
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales
-
IndisputableMonolith.Foundation.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
-
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
-
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
-
InCoder: A Generative Model for Code Infilling and Synthesis
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
-
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
-
RuC: HDL-Agnostic Rule Completion Benchmark Generation
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-...
-
CPT: Controllable and Editable Design Variations with Language Models
CPT is a fine-tuned language model that uses Creative Markup Language representations of professional designs to generate controllable, stylistically coherent, and fully editable design variations.
-
Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production
A new past-conditioned future predictability measure best explains phonetic reduction and substitution error identity in naturalistic speech, subsuming backward predictability.
-
Mercury: Ultra-Fast Language Models Based on Diffusion
Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.
-
Qwen2.5-1M Technical Report
Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
-
Smart Paste: Automatically Fixing Copy/Paste for Google Developers
Smart Paste applies deep learning to predict and suggest post-paste code edits in Google's IDE, achieving 45% acceptance and contributing over 1% of all code written company-wide after deployment.
-
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Continuous diffusion for categorical data
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
-
Qwen2.5-Coder Technical Report
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
Reference graph
Works this paper leans on
-
[1]
Hurdles to Progress in Long-form Question Answering , author=. arXiv preprint
-
[2]
Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , journal=
- [3]
-
[5]
Payne, Christine , journal=
-
[7]
Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Improving language understanding by generative pre-training , author=
-
[12]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[13]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [15]
-
[20]
Jurassic-1: Technical details and evaluation , author=. White Paper. AI21 Labs , year=
- [22]
-
[23]
Retrieval-augmented generation for knowledge-intensive
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-augmented generation for knowledge-intensive. arXiv preprint
-
[25]
Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=
-
[26]
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=. Trivia
-
[27]
Think you have Solved Direct-Answer Question Answering?
Bhakthavatsalam, Sumithra and Khashabi, Daniel and Khot, Tushar and Mishra, Bhavana Dalvi and Richardson, Kyle and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind and Clark, Peter , journal=. Think you have Solved Direct-Answer Question Answering?
-
[28]
Question and answer test-train overlap in open-domain question answering datasets , author=. arXiv preprint
-
[29]
Cheng, Hao and Shen, Yelong and Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng , journal=. United
- [30]
- [31]
-
[32]
On faithfulness and factuality in abstractive summarization , author=. arXiv preprint
-
[33]
Retrieval Augmentation Reduces Hallucination in Conversation , author=. arXiv preprint
-
[34]
Supervising strong learners by amplifying weak experts , author=. arXiv preprint
-
[35]
Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=
-
[36]
Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint
-
[37]
Superintelligence: Paths, Dangers, Strategies , author=. 2014 , publisher=
work page 2014
-
[38]
Rethinking Search: Making Experts out of Dilettantes , author=. arXiv preprint
-
[39]
Finding generalizable evidence by learning to convince q&a models , author=. arXiv preprint
-
[40]
Journal of the American Medical Informatics Association , volume=
Automation bias: a systematic review of frequency, effect mediators, and mitigators , author=. Journal of the American Medical Informatics Association , volume=. 2012 , publisher=
work page 2012
-
[41]
SIAM journal on control and optimization , volume=
Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=
work page 1992
- [42]
-
[43]
Building Watson: An overview of the DeepQA project , author=. AI magazine , volume=
- [44]
-
[45]
Interactive machine comprehension with information seeking agents , author=. arXiv preprint
-
[46]
Dense passage retrieval for open-domain question answering , author=. arXiv preprint
-
[47]
Improving information extraction by acquiring external evidence with reinforcement learning , author=. arXiv preprint
-
[48]
International Conference on Machine Learning , pages=
World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
- [49]
-
[50]
Framing theory , author=. Annu. Rev. Polit. Sci. , volume=. 2007 , publisher=
work page 2007
-
[51]
OpenAI and Bavarian, Mohammad and Jiang, Angela and Jun, Heewoo and Pondé, Henrique , journal=
-
[55]
and Davis, Ernest and Morgenstern, Leora , biburl =
Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora , biburl =. The. Proceedings of the
-
[57]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[61]
C o QA : A Conversational Question Answering Challenge
Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266
-
[62]
InCoder: A Generative Model for Code Infilling and Synthesis
Fried, Daniel and Aghajanyan, Armen and Lin, Jessy and Wang, Sida and Wallace, Eric and Shi, Freda and Zhong, Ruiqi and Yih, Wen-tau and Zettlemoyer, Luke and Lewis, Mike , keywords =. InCoder: A Generative Model for Code Infilling and Synthesis , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05999 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05999 2022
-
[64]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich and Barry Haddow and Alexandra Birch , title =. CoRR , volume =. 2015 , url =. 1508.07909 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[66]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[68]
Proceedings of the 37th International Conference on Machine Learning , pages =
Distribution Augmentation for Generative Modeling , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =
work page 2020
-
[72]
Marie. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pages =. 2021 , url =
work page 2021
- [74]
-
[75]
Proceedings of the 36th International Conference on Machine Learning , pages =
Insertion Transformer: Flexible Sequence Generation via Insertion Operations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
work page 2019
-
[83]
MaskGAN: Better Text Generation via Filling in the______
Fedus, William and Goodfellow, Ian and Dai, Andrew M. , keywords =. MaskGAN: Better Text Generation via Filling in the\_\_\_\_\_\_ , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1801.07736 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.07736 2018
-
[85]
Hochreiter, Sepp and Schmidhuber, J\". Long Short-Term Memory , year =. Neural Comput. , month =. doi:10.1162/neco.1997.9.8.1735 , abstract =
-
[87]
Artetxe, Mikel and Du, Jingfei and Goyal, Naman and Zettlemoyer, Luke and Stoyanov, Ves , keywords =. On the Role of Bidirectionality in Language Model Pre-Training , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.11726 , url =
-
[88]
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=. 2022 , eprint=
work page 2022
-
[91]
Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
-
[92]
The Curious Case of Neural Text Degeneration
Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , biburl =. The Curious Case of Neural Text Degeneration. , url =. ICLR , ee =
-
[93]
Tay, Yi and Dehghani, Mostafa and Tran, Vinh Q. and Garcia, Xavier and Bahri, Dara and Schuster, Tal and Zheng, Huaixiu Steven and Houlsby, Neil and Metzler, Donald , keywords =. Unifying Language Learning Paradigms , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2205.05131 , url =
-
[94]
A. Aghajanyan, D. Okhonko, M. Lewis, M. Joshi, H. Xu, G. Ghosh, and L. Zettlemoyer. HTLM: hyper-text pre-training and prompting of language models. CoRR, abs/2107.06955, 2021. URL https://arxiv.org/abs/2107.06955
-
[95]
com/blog/continuous-batching-llm-inference
A. Aghajanyan, B. Huang, C. Ross, V. Karpukhin, H. Xu, N. Goyal, D. Okhonko, M. Joshi, G. Ghosh, M. Lewis, and L. Zettlemoyer. CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520, 2022. URL https://arxiv.org/abs/2201.07520
-
[96]
M. Artetxe, J. Du, N. Goyal, L. Zettlemoyer, and V. Stoyanov. On the role of bidirectionality in language model pre-training, 2022. URL https://arxiv.org/abs/2205.11726
-
[97]
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[98]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint https://arxiv.org/pdf/2005.14165.pdf cyan arXiv:2005.14165 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[99]
W. Chan, N. Kitaev, K. Guu, M. Stern, and J. Uszkoreit. KERMIT: generative insertion-based modeling for sequences. CoRR, abs/1906.01604, 2019. URL http://arxiv.org/abs/1906.01604
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[100]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[101]
E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. Q u AC : Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174--2184, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1241. URL https://acla...
-
[102]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[103]
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1285. URL ht...
- [104]
-
[105]
In: North American Chapter of the Association for Computational Linguistics (2019)
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...
-
[106]
C. Donahue, M. Lee, and P. Liang. Enabling language models to fill in the blanks. CoRR, abs/2005.05339, 2020. URL https://arxiv.org/abs/2005.05339
- [107]
-
[108]
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v...
-
[109]
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 2368--2378, M...
-
[110]
MaskGAN: Better Text Generation via Filling in the______
W. Fedus, I. Goodfellow, and A. M. Dai. Maskgan: Better text generation via filling in the\_\_\_\_\_\_, 2018. URL https://arxiv.org/abs/1801.07736
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[111]
InCoder: A Generative Model for Code Infilling and Synthesis
D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[112]
J. Gu, Q. Liu, and K. Cho. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7: 0 661--676, 2019. doi:10.1162/tacl_a_00292. URL https://aclanthology.org/Q19-1042
-
[113]
D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021
work page internal anchor Pith review arXiv 2021
-
[114]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[115]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. In ICLR. OpenReview.net, 2020. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2020.html#HoltzmanBDFC20
work page 2020
-
[116]
M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. S pan BERT : Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8: 0 64--77, 2020. doi:10.1162/tacl_a_00300. URL https://aclanthology.org/2020.tacl-1.5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.