arxiv: 2104.08691 · v2 · submitted 2021-04-18 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester , Rami Al-Rfou , Noah Constant

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt tuningsoft promptsparameter-efficient learninglanguage model scalingmodel tuningT5domain transferfrozen models

0 comments

The pith

As models grow to billions of parameters, learning a small set of soft prompts matches the performance of tuning all model weights while keeping the base model frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines prompt tuning, a method that learns continuous soft prompts through backpropagation to condition a frozen language model for specific tasks. Experiments on T5 models show that this approach grows more competitive with full model tuning as scale increases, eventually closing the performance gap at large sizes. This enables reuse of one shared model across many tasks without updating its weights. The technique also yields greater robustness under domain shifts than full tuning and simplifies earlier prefix-tuning methods while beating GPT-3 few-shot baselines.

Core claim

Prompt tuning closes the gap with model tuning at large scales: as T5 models exceed billions of parameters, learned soft prompts achieve performance comparable to tuning all model weights, while remaining far more parameter-efficient and enabling the same frozen model to serve multiple downstream tasks.

What carries the argument

Soft prompts: a small set of continuous, trainable vectors optimized via gradient descent to condition the input of a frozen language model.

If this is right

One frozen model can be reused for many tasks by storing only the small prompt parameters instead of separate full copies.
Serving costs drop because the large model weights need to be loaded only once and shared across applications.
Domain-transfer robustness improves relative to full model tuning.
The approach simplifies prefix tuning while matching its results on the evaluated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines for very large models can shift toward storing and swapping small prompts rather than full fine-tuned weights.
If the trend continues, parameter-efficient adaptation may become the default route for applying foundation models to new tasks.
The method invites direct comparisons on non-T5 architectures to test whether the scale advantage is architecture-specific.

Load-bearing premise

The scaling trend observed on T5 models and the tested tasks will hold for other model families, architectures, and task distributions.

What would settle it

Prompt tuning failing to match full model tuning performance on a new family of models larger than a few billion parameters using the same training procedure.

read the original abstract

In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt tuning matches full tuning at large T5 scale, and the scaling curves are the useful new observation.

read the letter

The main takeaway is that a learned soft prompt on a frozen T5 model closes the performance gap to full model tuning once the backbone exceeds a few billion parameters. They demonstrate this with size ablations from 60M to 11B parameters and show the method also beats GPT-3 few-shot results while using far fewer trainable parameters. The approach is presented as a simplification of prefix tuning, dropping the per-layer prefixes in favor of a single prompt layer at the input. That reduction is clean and the direct comparisons to prefix tuning and other baselines are helpful. The robustness claim on domain transfer is a secondary but positive note. The experiments follow consistent protocols across model sizes and include prompt-length sweeps, which strengthens the scaling story. No circularity or obvious fitting issues appear in the reported setup. The limitation is that all results stay within the T5 family on standard classification and generation tasks. We lack data on decoder-only models or harder distributions, so the generalization of the scaling trend remains an open question rather than a proven law. The abstract does not report run-to-run variance or formal significance tests, but the trend direction is consistent enough that this does not undermine the central observation. This paper is for groups working on efficient adaptation of large language models who need a lightweight alternative to full fine-tuning or storing multiple copies. It is worth a serious referee because the empirical scaling result is reproducible in principle, addresses a practical cost issue, and the method is simple enough to test quickly. I would send it out for review.

Referee Report

1 major / 3 minor

Summary. The paper introduces prompt tuning, a parameter-efficient adaptation method that learns a small number of continuous 'soft prompt' embeddings while keeping the underlying language model (T5) frozen. It reports that on GLUE and SuperGLUE tasks, prompt tuning's performance gap to full model tuning shrinks with scale; at 11B parameters the two methods become competitive, and prompt tuning also outperforms GPT-3 few-shot learning while showing improved robustness under domain shift.

Significance. If the reported scaling trend holds, the result is significant: it demonstrates that a single frozen model can be reused across many tasks via tiny per-task prompt parameters, substantially lowering storage and serving costs for large LMs. The systematic size ablations on T5 (60M–11B) and direct comparisons to prefix tuning constitute a clear empirical contribution.

major comments (1)

[§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.

minor comments (3)

[§3.1] §3.1: the definition of the soft prompt as a sequence of length k is clear, but the initialization scheme (random vs. vocabulary tokens) and whether it is held constant across all model sizes should be stated explicitly in the main text rather than only in the appendix.
[Figure 3] Figure 3: axis labels and legend entries are too small for print; the scaling curves would be easier to read if the x-axis were log-scaled with explicit parameter counts annotated.
[§5] §5: the discussion of domain-transfer robustness would benefit from a brief statement of how the source and target domains were selected and whether the improvement is consistent across all transfer pairs or driven by a subset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.

Authors: We agree that reporting variability across random seeds and including statistical significance tests would make the central claim more robust. Our original experiments used single runs for the 11B models owing to the substantial computational cost of training and evaluating models at this scale. In the revised manuscript, we will rerun the 11B-scale experiments with multiple random seeds, report mean performance and standard deviations in Table 1 and §4.2, and add a brief discussion of statistical significance for the key comparisons. This will allow readers to assess whether the performance gap has closed within the observed variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical scaling observations only

full rationale

The paper reports direct experimental comparisons of prompt tuning versus model tuning across T5 model sizes (60M to 11B parameters) on standard NLP tasks. No equations, fitted parameters, or predictions are defined in terms of the target metrics; performance gaps are measured on held-out test sets using fixed training protocols. The central claim is an observed trend, not a derivation. Self-citations are absent from load-bearing steps, and the prefix-tuning comparison cites external work (Li & Liang 2021) without circular reduction. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard transfer-learning assumptions for language models plus the empirical observation of scaling. No load-bearing free parameters or invented entities are required beyond the soft prompt vectors themselves.

free parameters (1)

soft prompt length
The number of tokens in the learned prompt is a hyperparameter selected for each experiment.

axioms (1)

domain assumption A frozen pre-trained language model encodes sufficient general knowledge that task-specific behavior can be elicited by conditioning on a small learned input prefix.
Invoked throughout the method description and scaling experiments.

invented entities (1)

soft prompt no independent evidence
purpose: Continuous vector prefix that conditions the frozen model for a downstream task.
The central new mechanism introduced and optimized via backpropagation.

pith-pipeline@v0.9.0 · 5503 in / 1389 out tokens · 80082 ms · 2026-05-11T16:27:24.983361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prompt tuning becomes more competitive with scale

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Finetuned Language Models Are Zero-Shot Learners
cs.CL 2021-09 accept novelty 8.0

Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
q-bio.GN 2026-04 unverdicted novelty 7.0

CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
Combining pre-trained models via localized model averaging
stat.ME 2026-05 unverdicted novelty 6.0

Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
Are Large Language Models Economically Viable for Industry Deployment?
cs.CL 2026-04 unverdicted novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
ConforNets: Latents-Based Conformational Control in OpenFold3
q-bio.BM 2026-04 unverdicted novelty 6.0

ConforNets use channel-wise affine transforms on pre-Pairformer pair latents in OpenFold3 to achieve state-of-the-art unsupervised generation of alternate protein states and supervised conformational transfer across families.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical predic...
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation
cs.LG 2026-04 unverdicted novelty 6.0

LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CoLA introduces a dual-path low-rank adaptation method that adds cross-modal learning to LoRA, delivering small gains over standard LoRA on visual grounding and audio-visual benchmarks while preserving parameter efficiency.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
cs.CL 2022-05 unverdicted novelty 6.0

MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning
cs.AI 2026-05 unverdicted novelty 5.0

HEDP uses energy regularization inspired by Helmholtz free energy plus hybrid energy-distance weighting in prompts to improve domain selection and achieve a 2.57% accuracy gain on benchmarks like CORe50 while mitigati...
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 5.0

FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
Deep Reprogramming Distillation for Medical Foundation Models
cs.CV 2026-05 unverdicted novelty 5.0

DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
cs.LG 2026-05 unverdicted novelty 5.0

AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
cs.DC 2026-04 unverdicted novelty 5.0

SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
cs.LG 2026-04 unverdicted novelty 5.0

FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
cs.LG 2026-04 unverdicted novelty 5.0

RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
cs.CL 2026-04 unverdicted novelty 5.0

SCHK-HTC uses sibling contrastive learning plus hierarchical prompt tuning to improve discrimination between confusable sibling classes in few-shot hierarchical text classification.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
cs.LG 2026-04 unverdicted novelty 2.0

A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · cited by 40 Pith papers · 2 internal anchors

[1]

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, volume 6, pages 6--4. Venice

work page 2006
[2]

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC

work page 2009
[3]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. 2018. http://github.com/google/jax JAX : composable transformations of P ython+ N um P y programs

work page 2018
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[5]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1300 2019
[6]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177--190. Springer

work page 2005
[7]

Marie-Catherine De Marneff, Mandy Simons, and Judith Tonhauser. 2019. The CommitmentBank : Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung 23

work page 2019
[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[9]

William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

work page 2005
[10]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1246 2019
[11]

Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP

work page 2019
[12]

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics

work page 2007
[14]

L. K. Hansen and P. Salamon . 1990. https://doi.org/10.1109/34.58871 Neural network ensembles . IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993--1001

work page doi:10.1109/34.58871 1990
[15]

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. 2020. http://github.com/google/flax F lax: A neural network library and ecosystem for JAX

work page 2020
[16]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. http://proceedings.mlr.press/v97/houlsby19a.html Parameter-efficient transfer learning for NLP . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine L...

work page 2019
[17]

Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1031 2018
[18]

Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs First Q uora dataset release: Question pairs

work page 2017
[19]

and Araki, Jun and Neubig, Graham

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. https://doi.org/10.1162/tacl_a_00324 How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423--438

work page doi:10.1162/tacl_a_00324 2020
[20]

Kembhavi , M

A. Kembhavi , M. Seo , D. Schwenk , J. Choi , A. Farhadi , and H. Hajishirzi . 2017. https://doi.org/10.1109/CVPR.2017.571 Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376--5384

work page doi:10.1109/cvpr.2017.571 2017
[21]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)

work page 2018
[22]

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. https://doi.org/10.18653/v1/P19-1478 A surprisingly robust trick for the W inograd schema challenge . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837--4842, Florence, Italy. Association for Computational L...

work page doi:10.18653/v1/p19-1478 2019
[24]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1082 2017
[25]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

work page 2017
[26]

Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The W inograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning

work page 2012
[27]

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/K17-1034 Zero-shot relation extraction via reading comprehension . In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017) , pages 333--342, Vancouver, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/k17-1034 2017
[28]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. https://doi.org/10.18653/v1/2020.acl-main.703 BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension . In Proceedings of the 58th Annual Meeting of the Associat...

work page doi:10.18653/v1/2020.acl-main.703 2020
[30]

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. http://arxiv.org/abs/2103.10385 GPT understands, too . CoRR, abs/2103.10385

work page arXiv 2021
[31]

Lajanugen Logeswaran, Ann Lee, Myle Ott, Honglak Lee, Marc'Aurelio Ranzato, and Arthur Szlam. 2020. http://arxiv.org/abs/2012.09543 Few-shot sequence learning with transformers . CoRR, abs/2012.09543

work page arXiv 2020
[32]

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted B oltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, page 807–814, Madison, WI, USA. Omnipress

work page 2010
[33]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 Deep contextualized word representations . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lon...

work page doi:10.18653/v1/n18-1202 2018
[34]

Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...

work page doi:10.18653/v1/2020.emnlp-main.617 2020
[35]

Mohammad Taher Pilehvar and Jose Camacho - Collados. 2018. http://arxiv.org/abs/1808.09121 WiC : 10,000 example pairs for evaluating context-sensitive representations . CoRR, abs/1808.09121

work page Pith review arXiv 2018
[37]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training

work page 2018
[38]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI Blog

work page 2019
[39]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020
[41]

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf Learning multiple visual domains with residual adapters . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

work page 2017
[42]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series

work page 2011
[43]

Khapra, and Karthik Sankaranarayanan

Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. https://doi.org/10.18653/v1/P18-1156 D uo RC : Towards complex language understanding with paraphrased reading comprehension . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1683--1693, Melbourne, ...

work page doi:10.18653/v1/p18-1156 2018
[44]

Timo Schick and Hinrich Sch \"u tze. 2021. https://aclanthology.org/2021.eacl-main.20 Exploiting cloze-questions for few-shot text classification and natural language inference . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255--269, Online. Association for Computational...

work page 2021
[45]

Noam Shazeer. 2020. http://arxiv.org/abs/2002.05202 GLU variants improve transformer . CoRR, abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Noam Shazeer and Mitchell Stern. 2018. http://proceedings.mlr.press/v80/shazeer18a.html Adafactor: Adaptive learning rates with sublinear memory cost . In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596--4604. PMLR

work page 2018
[47]

Logan IV, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.346 A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222--4235, On...

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30, pages 5998--6008

work page 2017
[49]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019 a . https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf SuperGLUE : A stickier benchmark for general-purpose language understanding systems . In Advances in Neural Information Processing System...

work page 2019
[50]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 b . GLUE : A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR

work page 2019
[52]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. http://arxiv.org/abs/1810.12885 ReCoRD : Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885

work page Pith review arXiv 2018
[53]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[54]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[55]

Prefix-tuning: Optimizing continuous prompts for generation

Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.353

work page doi:10.18653/v1/2021.acl-long.353 2021
[56]

WARP : W ord-level A dversarial R e P rogramming

Hambardzumyan, Karen and Khachatrian, Hrant and May, Jonathan. WARP : W ord-level A dversarial R e P rogramming. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.381

work page doi:10.18653/v1/2021.acl-long.381 2021
[57]

CoRR , volume =

Lajanugen Logeswaran and Ann Lee and Myle Ott and Honglak Lee and Marc'Aurelio Ranzato and Arthur Szlam , title =. CoRR , volume =. 2020 , url =

work page 2020
[58]

Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

work page
[59]

Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

work page
[60]

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , note=

work page
[61]

Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen , booktitle=

work page
[62]

CoRR , volume =

Armen Aghajanyan and Luke Zettlemoyer and Sonal Gupta , title =. CoRR , volume =. 2020 , url =

work page 2020
[63]

Proceedings of the National Academy of Sciences , volume=

Transforming task representations to perform novel tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[64]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , url =. Advances in Neural Information Processing Systems , editor =

work page
[65]

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =

work page
[66]

Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

work page
[67]

Proceedings of Sinn und Bedeutung 23 , author=

The. Proceedings of Sinn und Bedeutung 23 , author=

work page
[68]

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2005 , organization=

work page 2005
[69]

The second

Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , booktitle=. The second. 2006 , organization=

work page 2006
[70]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007
[71]

The Fifth

Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo , booktitle=. The Fifth

work page
[72]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[73]

Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page
[74]

CoRR , volume=

Mohammad Taher Pilehvar and Jose Camacho. CoRR , volume=. 2018 , url=

work page 2018
[75]

Bowman , title =

Alex Warstadt and Amanpreet Singh and Samuel R. Bowman , title =. CoRR , volume =. 2018 , url =

work page 2018
[76]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[77]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation , author=. arXiv preprint arXiv:1708.00055 , year=

work page Pith review arXiv 2017
[78]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018

work page 2018
[79]

SQuAD : 100,000+ questions for machine comprehension of text

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[80]

CoRR , volume =

Xiao Liu and Yanan Zheng and Zhengxiao Du and Ming Ding and Yujie Qian and Zhilin Yang and Jie Tang , title =. CoRR , volume =. 2021 , url =

work page 2021
[81]

Language Models are Unsupervised Multitask Learners , author=

work page
[82]

2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=

A. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=. 2013 , volume=

work page 2013
[83]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[84]

CoRR , volume =

Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , title =. CoRR , volume =. 2018 , url =

work page 2018
[85]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=

A. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=. 2017 , volume=

work page 2017
[86]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

work page

Showing first 80 references.