pith. machine review for the scientific record. sign in

arxiv: 2104.08691 · v2 · submitted 2021-04-18 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Power of Scale for Parameter-Efficient Prompt Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords prompt tuningsoft promptsparameter-efficient learninglanguage model scalingmodel tuningT5domain transferfrozen models
0
0 comments X

The pith

As models grow to billions of parameters, learning a small set of soft prompts matches the performance of tuning all model weights while keeping the base model frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines prompt tuning, a method that learns continuous soft prompts through backpropagation to condition a frozen language model for specific tasks. Experiments on T5 models show that this approach grows more competitive with full model tuning as scale increases, eventually closing the performance gap at large sizes. This enables reuse of one shared model across many tasks without updating its weights. The technique also yields greater robustness under domain shifts than full tuning and simplifies earlier prefix-tuning methods while beating GPT-3 few-shot baselines.

Core claim

Prompt tuning closes the gap with model tuning at large scales: as T5 models exceed billions of parameters, learned soft prompts achieve performance comparable to tuning all model weights, while remaining far more parameter-efficient and enabling the same frozen model to serve multiple downstream tasks.

What carries the argument

Soft prompts: a small set of continuous, trainable vectors optimized via gradient descent to condition the input of a frozen language model.

If this is right

  • One frozen model can be reused for many tasks by storing only the small prompt parameters instead of separate full copies.
  • Serving costs drop because the large model weights need to be loaded only once and shared across applications.
  • Domain-transfer robustness improves relative to full model tuning.
  • The approach simplifies prefix tuning while matching its results on the evaluated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment pipelines for very large models can shift toward storing and swapping small prompts rather than full fine-tuned weights.
  • If the trend continues, parameter-efficient adaptation may become the default route for applying foundation models to new tasks.
  • The method invites direct comparisons on non-T5 architectures to test whether the scale advantage is architecture-specific.

Load-bearing premise

The scaling trend observed on T5 models and the tested tasks will hold for other model families, architectures, and task distributions.

What would settle it

Prompt tuning failing to match full model tuning performance on a new family of models larger than a few billion parameters using the same training procedure.

read the original abstract

In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces prompt tuning, a parameter-efficient adaptation method that learns a small number of continuous 'soft prompt' embeddings while keeping the underlying language model (T5) frozen. It reports that on GLUE and SuperGLUE tasks, prompt tuning's performance gap to full model tuning shrinks with scale; at 11B parameters the two methods become competitive, and prompt tuning also outperforms GPT-3 few-shot learning while showing improved robustness under domain shift.

Significance. If the reported scaling trend holds, the result is significant: it demonstrates that a single frozen model can be reused across many tasks via tiny per-task prompt parameters, substantially lowering storage and serving costs for large LMs. The systematic size ablations on T5 (60M–11B) and direct comparisons to prefix tuning constitute a clear empirical contribution.

major comments (1)
  1. [§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.
minor comments (3)
  1. [§3.1] §3.1: the definition of the soft prompt as a sequence of length k is clear, but the initialization scheme (random vs. vocabulary tokens) and whether it is held constant across all model sizes should be stated explicitly in the main text rather than only in the appendix.
  2. [Figure 3] Figure 3: axis labels and legend entries are too small for print; the scaling curves would be easier to read if the x-axis were log-scaled with explicit parameter counts annotated.
  3. [§5] §5: the discussion of domain-transfer robustness would benefit from a brief statement of how the source and target domains were selected and whether the improvement is consistent across all transfer pairs or driven by a subset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.

    Authors: We agree that reporting variability across random seeds and including statistical significance tests would make the central claim more robust. Our original experiments used single runs for the 11B models owing to the substantial computational cost of training and evaluating models at this scale. In the revised manuscript, we will rerun the 11B-scale experiments with multiple random seeds, report mean performance and standard deviations in Table 1 and §4.2, and add a brief discussion of statistical significance for the key comparisons. This will allow readers to assess whether the performance gap has closed within the observed variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical scaling observations only

full rationale

The paper reports direct experimental comparisons of prompt tuning versus model tuning across T5 model sizes (60M to 11B parameters) on standard NLP tasks. No equations, fitted parameters, or predictions are defined in terms of the target metrics; performance gaps are measured on held-out test sets using fixed training protocols. The central claim is an observed trend, not a derivation. Self-citations are absent from load-bearing steps, and the prefix-tuning comparison cites external work (Li & Liang 2021) without circular reduction. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard transfer-learning assumptions for language models plus the empirical observation of scaling. No load-bearing free parameters or invented entities are required beyond the soft prompt vectors themselves.

free parameters (1)
  • soft prompt length
    The number of tokens in the learned prompt is a hyperparameter selected for each experiment.
axioms (1)
  • domain assumption A frozen pre-trained language model encodes sufficient general knowledge that task-specific behavior can be elicited by conditioning on a small learned input prefix.
    Invoked throughout the method description and scaling experiments.
invented entities (1)
  • soft prompt no independent evidence
    purpose: Continuous vector prefix that conditions the frozen model for a downstream task.
    The central new mechanism introduced and optimized via backpropagation.

pith-pipeline@v0.9.0 · 5503 in / 1389 out tokens · 80082 ms · 2026-05-11T16:27:24.983361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Finetuned Language Models Are Zero-Shot Learners

    cs.CL 2021-09 accept novelty 8.0

    Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.

  2. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  3. CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

    q-bio.GN 2026-04 unverdicted novelty 7.0

    CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...

  4. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  5. Efficient Memory Management for Large Language Model Serving with PagedAttention

    cs.LG 2023-09 conditional novelty 7.0

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  6. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  7. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  8. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  9. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  10. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  11. LoRA: Low-Rank Adaptation of Large Language Models

    cs.CL 2021-06 accept novelty 7.0

    Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.

  12. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  13. Combining pre-trained models via localized model averaging

    stat.ME 2026-05 unverdicted novelty 6.0

    Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.

  14. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  15. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  16. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.

  17. Are Large Language Models Economically Viable for Industry Deployment?

    cs.CL 2026-04 unverdicted novelty 6.0

    Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

  18. ConforNets: Latents-Based Conformational Control in OpenFold3

    q-bio.BM 2026-04 unverdicted novelty 6.0

    ConforNets use channel-wise affine transforms on pre-Pairformer pair latents in OpenFold3 to achieve state-of-the-art unsupervised generation of alternate protein states and supervised conformational transfer across families.

  19. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  20. RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical predic...

  21. Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

  22. Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

    cs.LG 2026-04 unverdicted novelty 6.0

    LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.

  23. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CoLA introduces a dual-path low-rank adaptation method that adds cross-modal learning to LoRA, delivering small gains over standard LoRA on visual grounding and audio-visual benchmarks while preserving parameter efficiency.

  24. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  25. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  26. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  27. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    cs.CL 2022-05 unverdicted novelty 6.0

    MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.

  28. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  29. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  30. HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    HEDP uses energy regularization inspired by Helmholtz free energy plus hybrid energy-distance weighting in prompts to improve domain selection and achieve a 2.57% accuracy gain on benchmarks like CORe50 while mitigati...

  31. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...

  32. Deep Reprogramming Distillation for Medical Foundation Models

    cs.CV 2026-05 unverdicted novelty 5.0

    DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...

  33. AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

    cs.LG 2026-05 unverdicted novelty 5.0

    AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.

  34. SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning

    cs.DC 2026-04 unverdicted novelty 5.0

    SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.

  35. FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

    cs.LG 2026-04 unverdicted novelty 5.0

    FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

  36. RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments

    cs.LG 2026-04 unverdicted novelty 5.0

    RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.

  37. HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation

    cs.LG 2026-04 unverdicted novelty 5.0

    HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.

  38. SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

    cs.CL 2026-04 unverdicted novelty 5.0

    SCHK-HTC uses sibling contrastive learning plus hierarchical prompt tuning to improve discrimination between confusable sibling classes in few-shot hierarchical text classification.

  39. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  40. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  41. The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

    cs.LG 2026-04 unverdicted novelty 2.0

    A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · cited by 40 Pith papers · 2 internal anchors

  1. [1]

    Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, volume 6, pages 6--4. Venice

  2. [2]

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC

  3. [3]

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. 2018. http://github.com/google/jax JAX : composable transformations of P ython+ N um P y programs

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  5. [5]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  6. [6]

    Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177--190. Springer

  7. [7]

    Marie-Catherine De Marneff, Mandy Simons, and Judith Tonhauser. 2019. The CommitmentBank : Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung 23

  8. [8]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

  9. [9]

    William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

  10. [10]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  11. [11]

    Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP

  12. [12]

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics

  13. [14]

    L. K. Hansen and P. Salamon . 1990. https://doi.org/10.1109/34.58871 Neural network ensembles . IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993--1001

  14. [15]

    Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. 2020. http://github.com/google/flax F lax: A neural network library and ecosystem for JAX

  15. [16]

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. http://proceedings.mlr.press/v97/houlsby19a.html Parameter-efficient transfer learning for NLP . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine L...

  16. [17]

    Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics

  17. [18]

    Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs First Q uora dataset release: Question pairs

  18. [19]

    and Araki, Jun and Neubig, Graham

    Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. https://doi.org/10.1162/tacl_a_00324 How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423--438

  19. [20]

    Kembhavi , M

    A. Kembhavi , M. Seo , D. Schwenk , J. Choi , A. Farhadi , and H. Hajishirzi . 2017. https://doi.org/10.1109/CVPR.2017.571 Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376--5384

  20. [21]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)

  21. [22]

    Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. https://doi.org/10.18653/v1/P19-1478 A surprisingly robust trick for the W inograd schema challenge . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837--4842, Florence, Italy. Association for Computational L...

  22. [24]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics

  23. [25]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  24. [26]

    Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The W inograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning

  25. [27]

    Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/K17-1034 Zero-shot relation extraction via reading comprehension . In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017) , pages 333--342, Vancouver, Canada. Association for Computational Linguistics

  26. [28]

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. https://doi.org/10.18653/v1/2020.acl-main.703 BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension . In Proceedings of the 58th Annual Meeting of the Associat...

  27. [30]

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. http://arxiv.org/abs/2103.10385 GPT understands, too . CoRR, abs/2103.10385

  28. [31]

    Lajanugen Logeswaran, Ann Lee, Myle Ott, Honglak Lee, Marc'Aurelio Ranzato, and Arthur Szlam. 2020. http://arxiv.org/abs/2012.09543 Few-shot sequence learning with transformers . CoRR, abs/2012.09543

  29. [32]

    Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted B oltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, page 807–814, Madison, WI, USA. Omnipress

  30. [33]

    Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 Deep contextualized word representations . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lon...

  31. [34]

    Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...

  32. [35]

    Mohammad Taher Pilehvar and Jose Camacho - Collados. 2018. http://arxiv.org/abs/1808.09121 WiC : 10,000 example pairs for evaluating context-sensitive representations . CoRR, abs/1808.09121

  33. [37]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training

  34. [38]

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI Blog

  35. [39]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

  36. [41]

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf Learning multiple visual domains with residual adapters . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  37. [42]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series

  38. [43]

    Khapra, and Karthik Sankaranarayanan

    Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. https://doi.org/10.18653/v1/P18-1156 D uo RC : Towards complex language understanding with paraphrased reading comprehension . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1683--1693, Melbourne, ...

  39. [44]

    Timo Schick and Hinrich Sch \"u tze. 2021. https://aclanthology.org/2021.eacl-main.20 Exploiting cloze-questions for few-shot text classification and natural language inference . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255--269, Online. Association for Computational...

  40. [45]

    Noam Shazeer. 2020. http://arxiv.org/abs/2002.05202 GLU variants improve transformer . CoRR, abs/2002.05202

  41. [46]

    Noam Shazeer and Mitchell Stern. 2018. http://proceedings.mlr.press/v80/shazeer18a.html Adafactor: Adaptive learning rates with sublinear memory cost . In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596--4604. PMLR

  42. [47]

    Logan IV, Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.346 A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222--4235, On...

  43. [48]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30, pages 5998--6008

  44. [49]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019 a . https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf SuperGLUE : A stickier benchmark for general-purpose language understanding systems . In Advances in Neural Information Processing System...

  45. [50]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 b . GLUE : A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR

  46. [52]

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. http://arxiv.org/abs/1810.12885 ReCoRD : Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885

  47. [53]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  48. [54]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  49. [55]

    Prefix-tuning: Optimizing continuous prompts for generation

    Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.353

  50. [56]

    WARP : W ord-level A dversarial R e P rogramming

    Hambardzumyan, Karen and Khachatrian, Hrant and May, Jonathan. WARP : W ord-level A dversarial R e P rogramming. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.381

  51. [57]

    CoRR , volume =

    Lajanugen Logeswaran and Ann Lee and Myle Ott and Honglak Lee and Marc'Aurelio Ranzato and Arthur Szlam , title =. CoRR , volume =. 2020 , url =

  52. [58]

    Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

  53. [59]

    Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

  54. [60]

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , note=

  55. [61]

    Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen , booktitle=

  56. [62]

    CoRR , volume =

    Armen Aghajanyan and Luke Zettlemoyer and Sonal Gupta , title =. CoRR , volume =. 2020 , url =

  57. [63]

    Proceedings of the National Academy of Sciences , volume=

    Transforming task representations to perform novel tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  58. [64]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , url =. Advances in Neural Information Processing Systems , editor =

  59. [65]

    Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =

  60. [66]

    Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

    Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  61. [67]

    Proceedings of Sinn und Bedeutung 23 , author=

    The. Proceedings of Sinn und Bedeutung 23 , author=

  62. [68]

    Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2005 , organization=

  63. [69]

    The second

    Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , booktitle=. The second. 2006 , organization=

  64. [70]

    The third

    Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

  65. [71]

    The Fifth

    Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo , booktitle=. The Fifth

  66. [72]

    2011 AAAI Spring Symposium Series , year=

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

  67. [73]

    Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The

  68. [74]

    CoRR , volume=

    Mohammad Taher Pilehvar and Jose Camacho. CoRR , volume=. 2018 , url=

  69. [75]

    Bowman , title =

    Alex Warstadt and Amanpreet Singh and Samuel R. Bowman , title =. CoRR , volume =. 2018 , url =

  70. [76]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  71. [77]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation , author=. arXiv preprint arXiv:1708.00055 , year=

  72. [78]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018

  73. [79]

    SQuAD : 100,000+ questions for machine comprehension of text

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

  74. [80]

    CoRR , volume =

    Xiao Liu and Yanan Zheng and Zhengxiao Du and Ming Ding and Yujie Qian and Zhilin Yang and Jie Tang , title =. CoRR , volume =. 2021 , url =

  75. [81]

    Language Models are Unsupervised Multitask Learners , author=

  76. [82]

    2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=

    A. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=. 2013 , volume=

  77. [83]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

  78. [84]

    CoRR , volume =

    Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , title =. CoRR , volume =. 2018 , url =

  79. [85]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=

    A. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=. 2017 , volume=

  80. [86]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

Showing first 80 references.