arxiv: 1804.07461 · v3 · submitted 2018-04-20 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Felix Hill, Julian Michael, Omer Levy, Samuel R. Bowman

Pith reviewed 2026-05-12 21:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords GLUEnatural language understandingbenchmarkmulti-task learningtransfer learningNLU evaluationdiagnostic analysis

0 comments

The pith

GLUE supplies a benchmark of nine NLU tasks plus diagnostics to test models for general rather than task-specific language understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLUE as a single evaluation platform that aggregates results across existing natural language understanding tasks to measure how well models handle language in a general way. Current models often excel only when trained separately on one task at a time, so the benchmark includes several tasks with very small training sets to reward approaches that share knowledge across problems. It also supplies a separate hand-crafted diagnostic suite that breaks down model errors by specific linguistic features such as coreference or negation. Baseline experiments with multi-task and transfer methods show they produce little gain over training one model per task, which points to the need for new techniques that truly generalize.

Core claim

GLUE is a model-agnostic collection of nine NLU tasks together with a diagnostic test suite that together measure whether a system exhibits broad language understanding, and current multi-task baselines fail to improve substantially on the aggregate score obtained by training separate models per task.

What carries the argument

The GLUE benchmark itself, which combines performance scores from nine tasks with limited-data subsets and a hand-crafted diagnostic test suite for linguistic analysis.

If this is right

A single aggregate score can rank models on their overall language understanding ability.
Training regimes that move knowledge between tasks become directly measurable and rewarded.
The diagnostic suite can isolate which linguistic phenomena still cause models to fail.
Further progress requires methods that go beyond simple multi-task fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GLUE could serve as a stable reference point for comparing new NLU systems over time.
Adding tasks that probe longer-range reasoning or world knowledge would test whether current high scores reflect deeper understanding.
If GLUE scores predict success on downstream applications, the benchmark could guide practical model selection.

Load-bearing premise

The nine chosen tasks are diverse enough to stand in for general language understanding rather than measuring narrow skills.

What would settle it

A model that scores high on the full GLUE suite but collapses on new tasks that require the same linguistic skills in fresh combinations would show the benchmark does not capture generality.

read the original abstract

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLUE bundles nine existing NLU tasks into one public score plus a diagnostic suite, which gives the field a concrete way to track cross-task progress even if the tasks are not fully independent.

read the letter

GLUE combines nine existing NLU tasks into one public score plus a diagnostic suite, which gives the field a concrete way to track cross-task progress even if the tasks are not fully independent. The paper releases the benchmark, selects tasks that differ in size and type, and shows that basic multi-task and transfer baselines do not beat single-task training by much on the aggregate metric. That result is useful because it sets a clear starting point and highlights the gap that later work tried to close. The diagnostic suite is a practical addition that lets people probe specific linguistic failures without new data collection. The paper also makes the evaluation model-agnostic and public, which lowers the barrier for comparison. The main soft spot is that the tasks still share surface cues such as lexical overlap and sentence length, so gains on the GLUE score can come from exploiting those patterns across several tasks rather than from deeper understanding. The paper selects tasks from existing datasets but does not run a quantitative check on task interdependence or shared artifacts, so the claim that the benchmark measures general NLU rests on an untested assumption. The WNLI task in particular has construction issues that weaken its contribution. The baselines are straightforward and the overall pattern holds, but stronger transfer methods developed later make the initial multi-task results look preliminary. This paper is for researchers who build or evaluate NLU models and need a shared yardstick. Readers focused on new architectures may find the baselines thin, but anyone tracking benchmark-driven progress will get direct value. It deserves a serious referee because the benchmark itself is a usable, reproducible resource that the community adopted quickly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the GLUE benchmark as a collection of nine existing NLU tasks (MNLI, QQP, SST-2, CoLA, STS-B, MRPC, RTE, WNLI, QNLI) chosen for diversity in type and data size, along with a hand-crafted diagnostic test suite for linguistic analysis. It evaluates single-task, multi-task, and transfer-learning baselines and reports that the latter approaches do not yield substantial aggregate improvements over per-task training, suggesting room for better general NLU methods.

Significance. If the task collection is representative and the baseline comparisons are reproducible, the work supplies a standardized, model-agnostic platform that directly incentivizes cross-task knowledge sharing and has already become a de-facto evaluation standard. The explicit release of the benchmark, code, and diagnostic suite constitutes a concrete reproducibility strength that supports community-wide adoption and iterative improvement.

major comments (3)

[§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.
[§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.
[§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'

minor comments (2)

[§5] §5 (diagnostic suite): a few concrete example items for each linguistic phenomenon would improve clarity and allow readers to assess the suite's coverage without consulting external resources.
[Table 1] Table 1 and §3: the WNLI task description should explicitly note its known label-distribution artifacts, as these affect interpretation of model performance on that sub-task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of our manuscript and their recommendation for minor revision. We address each major comment below, indicating where we will make revisions to the paper.

read point-by-point responses

Referee: [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.

Authors: We agree that the task selection in §3 relies on qualitative arguments regarding the diversity of the tasks in terms of format, size, and the phenomena they test. While this diversity is detailed in the paper and supported by the diagnostic suite, we acknowledge the benefit of quantitative evidence. In the revised manuscript, we will add an analysis of inter-task error correlations computed from our baseline models to provide quantitative support for the tasks measuring somewhat independent capabilities. revision: yes
Referee: [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.

Authors: We thank the referee for pointing this out. The original manuscript and accompanying code release aimed to provide sufficient details, but we agree that explicit specifications are needed for full reproducibility. We will revise §4 to include the precise task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds used in our experiments. revision: yes
Referee: [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'

Authors: We agree that including variance estimates and significance tests would strengthen the experimental claims. At the time of the original submission, we reported results from single runs due to computational constraints. For the revised version, we will re-run the main single-task and multi-task experiments with multiple random seeds to report means and standard deviations, and include statistical comparisons where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: GLUE is a definitional benchmark without derivations or self-referential reductions

full rationale

The paper introduces GLUE by selecting and aggregating nine existing NLU datasets (MNLI, QQP, etc.) and adding a diagnostic suite. No equations, fitted parameters, predictions, or uniqueness theorems appear. The claim that the collection measures 'general' NLU rests on an explicit assumption of task diversity rather than any derivation that reduces to its own inputs or prior self-citations. This is a resource paper whose central contribution is definitional and externally evaluable; no load-bearing step collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the selected tasks collectively probe general NLU without introducing new free parameters, axioms beyond standard task definitions, or invented entities.

axioms (1)

domain assumption The selected NLU tasks are representative of general language understanding capabilities.
Invoked in the motivation for combining tasks to incentivize sharing knowledge across limited-data settings.

pith-pipeline@v0.9.0 · 5467 in / 1077 out tokens · 41467 ms · 2026-05-12T21:18:30.925461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
We introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
cs.LG 2026-05 unverdicted novelty 7.0

EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Analysis and Explainability of LLMs Via Evolutionary Methods
cs.NE 2026-04 unverdicted novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
cs.LG 2026-04 unverdicted novelty 7.0

LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive resul...
Finding Meaning in Embeddings: Concept Separation Curves
cs.CL 2026-04 unverdicted novelty 6.0

Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
cs.CV 2026-04 unverdicted novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
LLMs Get Lost In Multi-Turn Conversation
cs.CL 2025-05 unverdicted novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Linformer: Self-Attention with Linear Complexity
cs.LG 2020-06 conditional novelty 6.0

Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices
cs.AR 2026-04 conditional novelty 5.0

Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.
A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
cs.LG 2026-04 unverdicted novelty 5.0

KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.
Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
cs.LG 2026-04 unverdicted novelty 5.0

BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
cs.AR 2026-04 unverdicted novelty 4.0

Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
GLU Variants Improve Transformer
cs.LG 2020-02 unverdicted novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 35 Pith papers

[1]

Neural machine translation by jointly learning to align and translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015

work page 2015
[2]

The second PASCAL recognising textual entailment challenge

Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006

work page 2006
[3]

The fifth PASCAL recognizing textual entailment challenge

Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009

work page 2009
[4]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642. Association for Computational Linguistics, 2015

work page 2015
[5]

Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In Eleventh International Workshop on Semantic Evaluations, 2017

work page 2017
[6]

One billion word benchmark for measuring progress in statistical language modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint 1312.3005, 2013

work page arXiv 2013
[7]

Natural language processing (almost) from scratch

Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12 0 (Aug): 0 2493--2537, 2011

work page 2011
[8]

Sent E val: An evaluation toolkit for universal sentence representations

Alexis Conneau and Douwe Kiela. Sent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018

work page 2018
[9]

Supervised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 9-11, 2017, pp.\ 681--691, 2017

work page 2017
[10]

Using the framework

Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Pulman, Ted Briscoe, Holger Maier, and Karsten Konrad. Using the framework. Technical report, The F ra C a S Consortium, 1996

work page 1996
[11]

The PASCAL recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pp.\ 177--190. Springer, 2006

work page 2006
[12]

Transforming question answering datasets into natural language inference datasets

Dorottya Demszky, Kelvin Guu, and Percy Liang. Transforming question answering datasets into natural language inference datasets. arXiv preprint 1809.02922, 2018

work page arXiv 2018
[13]

Automatically constructing a corpus of sentential paraphrases

William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005

work page 2005
[14]

Towards linguistically generalizable NLP systems: A workshop and shared task

Allyson Ettinger, Sudha Rao, Hal Daum \'e III, and Emily M Bender. Towards linguistically generalizable NLP systems: A workshop and shared task. In First Workshop on Building Linguistically Generalizable NLP Systems, 2017

work page 2017
[15]

Liu, Matthew Peters, Michael Schmitz, and Luke S

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. Allen NLP : A deep semantic natural language processing platform. 2017

work page 2017
[16]

The third PASCAL recognizing textual entailment challenge

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.\ 1--9. Association for Computational Linguistics, 2007

work page 2007
[17]

Comparing two k-category assignments by a k-category correlation coefficient

Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem., 28 0 (5-6): 0 367--374, December 2004. ISSN 1476-9271

work page 2004
[18]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

work page 2018
[19]

A joint many-task model: Growing a neural network for multiple nlp tasks

Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017

work page 2017
[20]

Learning distributed representations of sentences from unlabelled data

Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016

work page 2016
[21]

Mining and summarizing customer reviews

Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 168--177. ACM, 2004

work page 2004
[22]

Bag of tricks for efficient text classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint 1607.01759, 2016

work page arXiv 2016
[23]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015

work page 2015
[24]

Skip- T hought vectors

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip- T hought vectors. In Advances in Neural Information Processing Systems, pp.\ 3294--3302, 2015

work page 2015
[25]

Distributed representations of sentences and documents

Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.\ 1188--1196, Bejing, China, 22--24 Jun 2014. PMLR

work page 2014
[26]

The W inograd schema challenge

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, pp.\ 47, 2011

work page 2011
[27]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405 0 (2): 0 442--451, 1975

work page 1975
[28]

Learned in translation: Contextualized word vectors

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pp.\ 6297--6308, 2017

work page 2017
[29]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint 1806.08730, 2018

work page Pith review arXiv 2018
[30]

Thomas McCoy and Tal Linzen

R. Thomas McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language inference. In Proceedings of the Society for Computation in Linguistics, volume 2, pp.\ 357--360, 2019

work page 2019
[31]

Dissent: Sentence representation learning from explicit discourse relations

Allen Nie, Erin D Bennett, and Noah D Goodman. Dissent: Sentence representation learning from explicit discourse relations. arXiv preprint 1710.04334, 2017

work page arXiv 2017
[32]

A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts

Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp.\ 271. Association for Computational Linguistics, 2004

work page 2004
[33]

Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp.\ 115--124. Association for Computational Linguistics, 2005

work page 2005
[34]

G lo V e: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. G lo V e: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language processing, pp.\ 1532--1543, 2014

work page 2014
[35]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

work page 2018
[36]

Hypothesis only baselines in natural language inference

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In *SEM@NAACL-HLT, 2018

work page 2018
[37]

SQ u AD : 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392. Association for Computational Linguistics, 2016

work page 2016
[38]

Reasoning about entailment with neural attention

Tim Rockt \"a schel, Edward Grefenstette, Moritz Hermann, Karl, Tom \'a s Ko c isk \`y , and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations, 2016

work page 2016
[39]

Sluice networks: Learning what to share between loosely related tasks

Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders S gaard. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint 1705.08142, 2017

work page arXiv 2017
[40]

Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of CoNLL, 2017

work page 2017
[41]

Bidirectional attention flow for machine comprehension

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference of Learning Representations, 2017

work page 2017
[42]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

work page 2013
[43]

Deep multi-task learning with low level tasks supervised at lower layers

Anders S gaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pp.\ 231--235, 2016

work page 2016
[44]

Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In Proceedings of the International Conference on Learning Representations, 2018

work page 2018
[45]

Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

Masatoshi Tsuchiya. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA)

work page 2018
[46]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 6000--6010, 2017

work page 2017
[47]

The TREC -8 question answering track report

Ellen M Voorhees et al. The TREC -8 question answering track report. In TREC, volume 99, pp.\ 77--82, 1999

work page 1999
[48]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018

work page arXiv 2018
[49]

Inference is everything: Recasting semantic resources into a unified evaluation framework

Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp.\ 996--1005, 2017

work page 2017
[50]

Annotating expressions of opinions and emotions in language

Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. In Proceedings of the International Conference on Language Resources and Evaluation, volume 39, pp.\ 165--210. Springer, 2005

work page 2005
[51]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

work page 2018
[52]

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the International Conference on Computer Vision, pp.\ 19--27, 2015

work page 2015