pith. machine review for the scientific record. sign in

arxiv: 2101.00190 · v1 · submitted 2021-01-01 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Percy Liang, Xiang Lisa Li

Pith reviewed 2026-05-11 16:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords prefix-tuningcontinuous promptsparameter-efficient tuningnatural language generationtable-to-text generationsummarizationGPT-2BART
0
0 comments X

The pith

Prefix-tuning matches full fine-tuning on natural language generation by optimizing a small continuous prefix while freezing all language model parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes prefix-tuning as a lightweight method for adapting large pretrained models to generation tasks such as table-to-text and summarization. Instead of updating every parameter during fine-tuning, the approach freezes the model and optimizes only a short task-specific vector called the prefix. Subsequent tokens attend to this prefix exactly as they would to real input tokens. Experiments show that tuning just 0.1 percent of the total parameters produces performance comparable to full fine-tuning when plenty of data is available. The same prefix also yields stronger results than fine-tuning when data is scarce and when the test examples cover topics absent from training.

Core claim

Prefix-tuning keeps the parameters of a pretrained language model frozen and instead learns a small continuous task-specific vector, the prefix, that is prepended to the input. Because later tokens attend to the prefix as virtual tokens, the prefix can steer generation for a downstream task. Applied to GPT-2 on table-to-text generation and to BART on summarization, optimizing the prefix alone (0.1 percent of parameters) reaches performance levels comparable to full fine-tuning in the full-data regime, exceeds fine-tuning in low-data regimes, and extrapolates better to topics unseen during training.

What carries the argument

The prefix, a short continuous task-specific vector that is prepended so later tokens can attend to it as virtual tokens, allowing task adaptation without changing any model weights.

If this is right

  • Only the small prefix must be stored per task rather than a full copy of the model weights.
  • Task adaptation remains effective even when training data is limited.
  • Generation quality holds up better on topics outside the training distribution.
  • The same frozen backbone can support many tasks by swapping only the prefix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may scale to other generation tasks and model families without retraining the backbone each time.
  • Much of the knowledge needed for a task can be expressed through attention patterns to a learned prefix rather than weight changes.
  • Combining prefix-tuning with other storage-reduction techniques could further lower the cost of maintaining many specialized models.

Load-bearing premise

The frozen model's attention mechanism can be steered sufficiently well by the learned prefix to control output quality and generalization without any updates to the core parameters.

What would settle it

A controlled experiment on table-to-text or summarization where the best prefix-tuned model still underperforms the best fine-tuned model by a clear margin on automatic metrics or on human judgments of unseen topics.

read the original abstract

Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1\% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes prefix-tuning as a lightweight alternative to fine-tuning for natural language generation. It freezes all parameters of a pretrained model (GPT-2 for table-to-text, BART for summarization) and instead optimizes only a small continuous task-specific prefix (0.1% of parameters) that is prepended to the input and attended to as virtual tokens. The central empirical claims are that this yields performance comparable to full fine-tuning in the full-data regime, superior performance in low-data regimes, and better extrapolation to examples with topics unseen during training.

Significance. If the reported performance patterns hold under rigorous controls, the work would be a meaningful contribution to parameter-efficient adaptation of large language models. It reduces per-task storage to a tiny prefix rather than a full model copy and appears to improve robustness in low-resource and out-of-distribution settings. The approach is simple, draws directly on the prompting literature, and is evaluated on two concrete generation tasks with standard metrics.

major comments (2)
  1. [§4] §4 (Experiments), low-data tables: the outperformance over fine-tuning is reported without stating the precise training-set sizes, the sampling procedure for the low-data subsets, or error bars across multiple random seeds. Because low-data regimes are central to the strongest claim, these details are required to assess whether the gains are reliable or could be explained by optimization variance.
  2. [§4.3] §4.3 or extrapolation subsection: the claim that prefix-tuning extrapolates better to unseen topics lacks an explicit definition of topic partitioning, a description of how held-out topics are constructed, and any analysis showing that the learned prefix modulates attention patterns rather than memorizing surface n-grams from the training distribution. This directly bears on whether the frozen attention mechanism can be steered as assumed.
minor comments (3)
  1. [§3] §3 (Method): the reparameterization of the prefix via an MLP is introduced but the precise initialization distribution, the choice of prefix length, and any ablation on these hyperparameters are not shown; a short table or paragraph would clarify reproducibility.
  2. [Abstract] Abstract and §1: the 0.1% parameter figure should be tied to the concrete model sizes (GPT-2 and BART variants) used in the experiments rather than left as a general statement.
  3. [Figure 1] Figure 1 or method diagram: the visualization of how the prefix is inserted into the attention computation would benefit from an explicit equation showing the modified key/value projections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments on our work. We address each major point below and will update the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments), low-data tables: the outperformance over fine-tuning is reported without stating the precise training-set sizes, the sampling procedure for the low-data subsets, or error bars across multiple random seeds. Because low-data regimes are central to the strongest claim, these details are required to assess whether the gains are reliable or could be explained by optimization variance.

    Authors: We agree these experimental details are essential for assessing reliability. In the revised version, we will explicitly report the precise training-set sizes (e.g., 100, 500, and 1000 examples for table-to-text; corresponding percentages for summarization), describe the sampling procedure (random subset selection with fixed random seeds for reproducibility, without stratification unless noted), and add error bars computed across at least three independent random seeds for both prefix-tuning and fine-tuning runs. These additions will directly address concerns about optimization variance. revision: yes

  2. Referee: [§4.3] §4.3 or extrapolation subsection: the claim that prefix-tuning extrapolates better to unseen topics lacks an explicit definition of topic partitioning, a description of how held-out topics are constructed, and any analysis showing that the learned prefix modulates attention patterns rather than memorizing surface n-grams from the training distribution. This directly bears on whether the frozen attention mechanism can be steered as assumed.

    Authors: We will add the requested clarifications. The revised manuscript will include: (1) an explicit definition of topic partitioning (using the dataset's inherent topic labels or k-means clustering on input features); (2) details on held-out topic construction (selecting topics with zero overlap in training data, ensuring no shared entities or keywords); and (3) supporting analysis, such as attention visualizations comparing prefix attention weights on unseen vs. seen topics, to demonstrate modulation of the frozen model rather than surface memorization. While a exhaustive mechanistic study exceeds the current scope, these additions will substantiate the extrapolation claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from prefix optimization are independent of fitted inputs

full rationale

The paper introduces prefix-tuning as an optimization procedure that freezes the pretrained LM parameters and tunes only a small continuous prefix vector. Reported outcomes (comparable full-data performance, superior low-data results, and better unseen-topic extrapolation) are presented as direct empirical measurements from applying this procedure to GPT-2 on table-to-text and BART on summarization. No equations, derivations, or self-citations reduce these performance numbers to quantities defined by the prefix itself or to prior author work that would make the central claim tautological. The method remains self-contained against external benchmarks because success is measured by standard generation metrics on held-out data rather than by construction from the optimization inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that a small learned prefix can steer a frozen model effectively, with the prefix vectors themselves serving as the primary fitted elements.

free parameters (2)
  • prefix length
    Chosen hyperparameter determining how many virtual tokens are prepended and optimized.
  • prefix vectors
    Continuous values optimized per task to condition the frozen model.
axioms (1)
  • domain assumption Pretrained language model parameters remain fixed and sufficient when augmented with task-specific continuous prefixes.
    Invoked to justify freezing the model while claiming comparable or superior task performance.
invented entities (1)
  • continuous prefix no independent evidence
    purpose: Task-specific virtual tokens that subsequent tokens attend to during generation.
    Core new construct introduced to enable parameter-efficient adaptation.

pith-pipeline@v0.9.0 · 5433 in / 1265 out tokens · 57188 ms · 2026-05-11T16:51:28.480360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.

  2. Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    ToGRL learns high-quality graph structures from raw heterogeneous graphs via a two-stage topology extraction process and prompt tuning, outperforming prior methods on five datasets.

  3. Efficient Memory Management for Large Language Model Serving with PagedAttention

    cs.LG 2023-09 conditional novelty 7.0

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  4. Steering Language Models With Activation Engineering

    cs.CL 2023-08 unverdicted novelty 7.0

    Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

  5. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  6. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  7. LoRA: Low-Rank Adaptation of Large Language Models

    cs.CL 2021-06 accept novelty 7.0

    Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.

  8. Combining pre-trained models via localized model averaging

    stat.ME 2026-05 unverdicted novelty 6.0

    Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.

  9. XPERT: Expert Knowledge Transfer for Effective Training of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.

  10. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.

  11. OLLM: Options-based Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    OLLM models next-token generation as a latent-indexed set of options, enabling up to 70% math reasoning correctness versus 51% baselines and structure-based alignment via a compact latent policy.

  12. ConforNets: Latents-Based Conformational Control in OpenFold3

    q-bio.BM 2026-04 unverdicted novelty 6.0

    ConforNets use channel-wise affine transforms on pre-Pairformer pair latents in OpenFold3 to achieve state-of-the-art unsupervised generation of alternate protein states and supervised conformational transfer across families.

  13. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  14. Fed3D: Federated 3D Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    Fed3D is a federated 3D object detection system using local-global class-aware loss for heterogeneity and prompt modules for low-bandwidth communication, claiming better performance than prior methods on limited local data.

  15. BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

    cs.LG 2026-04 unverdicted novelty 6.0

    BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.

  16. Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

  17. GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

    cs.DB 2026-04 unverdicted novelty 6.0

    GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.

  18. Visual prompting reimagined: The power of the Activation Prompts

    cs.CV 2026-04 unverdicted novelty 6.0

    Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.

  19. LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

    cs.CR 2026-04 unverdicted novelty 6.0

    LLM4CodeRE adapts LLMs with multi-adapter and seq2seq fine-tuning for accurate assembly-to-source decompilation and reverse translation in code reverse engineering.

  20. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CoLA introduces a dual-path low-rank adaptation method that adds cross-modal learning to LoRA, delivering small gains over standard LoRA on visual grounding and audio-visual benchmarks while preserving parameter efficiency.

  21. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  22. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  23. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    cs.AI 2023-03 conditional novelty 6.0

    CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

  24. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  25. HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    HEDP uses energy regularization inspired by Helmholtz free energy plus hybrid energy-distance weighting in prompts to improve domain selection and achieve a 2.57% accuracy gain on benchmarks like CORe50 while mitigati...

  26. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...

  27. Deep Reprogramming Distillation for Medical Foundation Models

    cs.CV 2026-05 unverdicted novelty 5.0

    DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...

  28. AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

    cs.LG 2026-05 unverdicted novelty 5.0

    AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.

  29. ChipLingo: A Systematic Training Framework for Large Language Models in EDA

    cs.LG 2026-04 unverdicted novelty 5.0

    ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.

  30. FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

    cs.LG 2026-04 unverdicted novelty 5.0

    FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

  31. AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    AeroRAG improves fine-grained aerial visual question answering by converting images to scene graphs and using retrieval-augmented generation to create compact LLM prompts.

  32. LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    LDEPrompt introduces layer-importance guided dual expandable prompt pools to achieve state-of-the-art class-incremental learning by enabling adaptive layer selection and dynamic prompt management.

  33. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  34. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

  35. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  36. The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

    cs.LG 2026-04 unverdicted novelty 2.0

    A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 35 Pith papers · 2 internal anchors

  1. [1]

    Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. 2020. http://arxiv.org/abs/2012.13255 Intrinsic dimensionality explains the effectiveness of language model fine-tuning

  2. [2]

    Anja Belz and Ehud Reiter. 2006. https://www.aclweb.org/anthology/E06-1040 Comparing automatic and human evaluation of NLG systems . In 11th Conference of the E uropean Chapter of the Association for Computational Linguistics , Trento, Italy. Association for Computational Linguistics

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. https://openreview.net/forum?id=H1edEyBKDS Plug and play language models: A simple approach to controlled text generation . In International Conference on Learning Representations

  5. [7]

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. http://proceedings.mlr.press/v97/houlsby19a.html Parameter-efficient transfer learning for NLP . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine L...

  6. [9]

    Mihir Kale. 2020. http://arxiv.org/abs/2005.10433 Text-to-text pre-training for data-to-text tasks

  7. [10]

    CTRL: A conditional transformer language model for controllable generation.Preprint arXiv:1909.05858,

    N. Keskar, B. McCann, L. R. Varshney, Caiming Xiong, and R. Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. ArXiv, abs/1909.05858

  8. [11]

    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2020. GeDi: Generative Discriminator Guided Sequence Generation . arXiv preprint arXiv:2009.06367

  9. [12]

    Alon Lavie and Abhaya Agarwal. 2007. http://dl.acm.org/citation.cfm?id=1626355.1626389 Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments . In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT '07, pages 228--231, Stroudsburg, PA, USA. Association for Computational Linguistics

  10. [14]

    Chin-Yew Lin. 2004. https://www.aclweb.org/anthology/W04-1013 ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  11. [17]

    Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. http://arxiv.org/abs/2001.08210 Multilingual denoising pre-training for neural machine translation

  12. [18]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. http://arxiv.org/abs/1907.11692 Roberta: A robustly optimized BERT pretraining approach . CoRR, abs/1907.11692

  13. [19]

    Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

  14. [20]

    Communication-efficient learning of deep networks from decentralized data,

    H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Ag \" u era y Arcas. 2016. http://arxiv.org/abs/1602.05629 Federated learning of deep networks using model averaging . Proceedings of the 20 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, abs/1602.05629

  15. [21]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don't give me the details, just the summary! T opic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium

  16. [22]

    Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. 2017. http://arxiv.org/abs/1706.09254 The E2E dataset: New challenges for end-to-end generation . CoRR, abs/1706.09254

  17. [24]

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2020. http://arxiv.org/abs/2005.00247 Adapterfusion: Non-destructive task composition for transfer learning

  18. [25]

    Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Nazneen Fatema Rajani, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Murori Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, and Richard Socher. 2020. http://ar...

  19. [26]

    Radford, Jeffrey Wu, R

    A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners

  20. [27]

    Evani Radiya-Dixit and Xin Wang. 2020. http://proceedings.mlr.press/v108/radiya-dixit20a.html How fine can fine-tuning be? learning efficient language models . In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2435--2443, Online. PMLR

  21. [28]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

  22. [29]

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf Learning multiple visual domains with residual adapters . In Advances in Neural Information Processing Systems, volume 30, pages 506--516. Curran Associates, Inc

  23. [30]

    Timo Schick and Hinrich Schütze. 2020. http://arxiv.org/abs/2001.07676 Exploiting cloze questions for few shot text classification and natural language inference

  24. [33]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. 2020. http://arxiv.org/abs/2010.15980 Autoprompt: Eliciting knowledge from language models with automatically generated prompts

  25. [35]

    Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel. 2006. A study of translation error rate with targeted human annotation. In In Proceedings of the Association for Machine Transaltion in the Americas (AMTA 2006

  26. [36]

    Asa Cooper Stickland, Xian Li, and Marjan Ghazvininejad. 2020. http://arxiv.org/abs/2004.14911 Recipes for adapting pre-trained monolingual and multilingual models to machine translation

  27. [37]

    Bowman, and Kyunghyun Cho

    Nishant Subramani, Samuel R. Bowman, and Kyunghyun Cho. 2020. http://arxiv.org/abs/1907.04944 Can unconditional language models recover arbitrary sentences?

  28. [38]

    Fan-Keng Sun and Cheng-I Lai. 2020. http://arxiv.org/abs/2011.07347 Conditioned natural language generation using only unconditioned language model: An exploration

  29. [39]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30, pages 5998--6008. Curran Associates, Inc

  30. [40]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2015.html#VedantamZP15 Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575. IEEE Computer Society

  31. [41]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

  32. [42]

    Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020 a . http://arxiv.org/abs/1912.13503 Side-tuning: A baseline for network adaptation via additive side networks

  33. [43]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020 b . https://openreview.net/forum?id=SkeHuCVFDr BERTScore : Evaluating text generation with bert . In International Conference on Learning Representations

  34. [45]

    Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, and Hinrich Schütze. 2020. http://arxiv.org/abs/2004.12406 Masking as an efficient alternative to finetuning for pretrained language models

  35. [48]

    Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. https://openreview.net/forum?id=Hyl7ygStwB Incorporating bert into neural machine translation . In International Conference on Learning Representations

  36. [49]

    Language Models are Unsupervised Multitask Learners , author=

  37. [50]

    In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  38. [51]

    , title =

    Kornblith, Simon and Shlens, Jonathon and Le, Quoc V. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  39. [52]

    2020 , eprint=

    Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , author=. 2020 , eprint=

  40. [53]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  41. [54]

    Learning multiple visual domains with residual adapters , url =

    Rebuffi, Sylvestre-Alvise and Bilen, Hakan and Vedaldi, Andrea , booktitle =. Learning multiple visual domains with residual adapters , url =

  42. [55]

    2020 , eprint=

    AdapterFusion: Non-Destructive Task Composition for Transfer Learning , author=. 2020 , eprint=

  43. [56]

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =

    How fine can fine-tuning be? Learning efficient language models , author =. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =. 2020 , editor =

  44. [57]

    and Araki, Jun and Neubig, Graham

    Jiang, Zhengbao and Xu, Frank F. and Araki, Jun and Neubig, Graham. How Can We Know What Language Models Know?. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00324

  45. [58]

    2020 , eprint=

    Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference , author=. 2020 , eprint=

  46. [59]

    2020 , eprint=

    AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts , author=. 2020 , eprint=

  47. [60]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , editor =

  48. [61]

    Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning

    Lin, Zhaojiang and Madotto, Andrea and Fung, Pascale. Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.41

  49. [62]

    Extractive Summarization as Text Matching

    Zhong, Ming and Liu, Pengfei and Chen, Yiran and Wang, Danqing and Qiu, Xipeng and Huang, Xuanjing. Extractive Summarization as Text Matching. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.552

  50. [63]

    Text Summarization with Pretrained Encoders

    Liu, Yang and Lapata, Mirella. Text Summarization with Pretrained Encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1387

  51. [64]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  52. [65]

    Lewis, Y

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

  53. [66]

    2020 , eprint=

    Text-to-Text Pre-Training for Data-to-Text Tasks , author=. 2020 , eprint=

  54. [67]

    2020 , eprint=

    DART: Open-Domain Structured Data Record to Text Generation , author=. 2020 , eprint=

  55. [68]

    DIALOGPT : Large-scale generative pre-training for conversational response generation

    Zhang, Yizhe and Sun, Siqi and Galley, Michel and Chen, Yen-Chun and Brockett, Chris and Gao, Xiang and Gao, Jianfeng and Liu, Jingjing and Dolan, Bill. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020....

  56. [69]

    2020 , eprint=

    Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation , author=. 2020 , eprint=

  57. [70]

    2020 , eprint=

    Multilingual Denoising Pre-training for Neural Machine Translation , author=. 2020 , eprint=

  58. [71]

    International Conference on Learning Representations , year=

    Incorporating BERT into Neural Machine Translation , author=. International Conference on Learning Representations , year=

  59. [72]

    2020 , eprint=

    Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks , author=. 2020 , eprint=

  60. [73]

    2020 , eprint=

    Can Unconditional Language Models Recover Arbitrary Sentences? , author=. 2020 , eprint=

  61. [74]

    2020 , eprint=

    Conditioned Natural Language Generation using only Unconditioned Language Model: An Exploration , author=. 2020 , eprint=

  62. [75]

    CoRR , volume =

    Jekaterina Novikova and Ondrej Dusek and Verena Rieser , title =. CoRR , volume =. 2017 , url =

  63. [76]

    The W eb NLG Challenge: Generating Text from RDF Data

    Gardent, Claire and Shimorina, Anastasia and Narayan, Shashi and Perez-Beltrachini, Laura. The W eb NLG Challenge: Generating Text from RDF Data. Proceedings of the 10th International Conference on Natural Language Generation. 2017. doi:10.18653/v1/W17-3518

  64. [77]

    Cohen and Mirella Lapata

    Shashi Narayan and Shay B. Cohen and Mirella Lapata. Don't Give Me the Details, Just the Summary! T opic-Aware Convolutional Neural Networks for Extreme Summarization. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  65. [78]

    B leu: a Method for Automatic Evaluation of Machine Translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , series =. 2002 , location =. doi:10.3115/1073083.1073135 , acmid =

  66. [79]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  67. [80]

    Comparing Automatic and Human Evaluation of NLG Systems

    Belz, Anja and Reiter, Ehud. Comparing Automatic and Human Evaluation of NLG Systems. 11th Conference of the E uropean Chapter of the Association for Computational Linguistics. 2006

  68. [81]

    Proceedings of the Second Workshop on Statistical Machine Translation , series =

    Lavie, Alon and Agarwal, Abhaya , title =. Proceedings of the Second Workshop on Statistical Machine Translation , series =. 2007 , location =

  69. [82]

    Lawrence and Parikh, Devi , biburl =

    Vedantam, Ramakrishna and Zitnick, C. Lawrence and Parikh, Devi , biburl =. CIDEr: Consensus-based image description evaluation. , url =. CVPR , crossref =

  70. [83]

    In Proceedings of the Association for Machine Transaltion in the Americas (AMTA 2006 , year =

    Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and Ralph Weischedel , title =. In Proceedings of the Association for Machine Transaltion in the Americas (AMTA 2006 , year =

  71. [84]

    and Eger, Steffen

    Zhao, Wei and Peyrard, Maxime and Liu, Fei and Gao, Yang and Meyer, Christian M. and Eger, Steffen. M over S core: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processi...

  72. [85]

    BLEURT : Learning Robust Metrics for Text Generation

    Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

  73. [86]

    Weinberger and Yoav Artzi , booktitle=

    Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , booktitle=. 2020 , url=

  74. [87]

    2020 , eprint=

    Masking as an Efficient Alternative to Finetuning for Pretrained Language Models , author=. 2020 , eprint=

  75. [88]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  76. [89]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  77. [90]

    Pragmatically Informative Text Generation

    Shen, Sheng and Fried, Daniel and Andreas, Jacob and Klein, Dan. Pragmatically Informative Text Generation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1410

  78. [91]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  79. [92]

    ArXiv , year=

    CTRL: A Conditional Transformer Language Model for Controllable Generation , author=. ArXiv , year=

  80. [93]

    International Conference on Learning Representations , year=

    Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=

Showing first 80 references.