pith. machine review for the scientific record. sign in

arxiv: 1804.07461 · v3 · submitted 2018-04-20 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Felix Hill, Julian Michael, Omer Levy, Samuel R. Bowman

Pith reviewed 2026-05-12 21:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords GLUEnatural language understandingbenchmarkmulti-task learningtransfer learningNLU evaluationdiagnostic analysis
0
0 comments X

The pith

GLUE supplies a benchmark of nine NLU tasks plus diagnostics to test models for general rather than task-specific language understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLUE as a single evaluation platform that aggregates results across existing natural language understanding tasks to measure how well models handle language in a general way. Current models often excel only when trained separately on one task at a time, so the benchmark includes several tasks with very small training sets to reward approaches that share knowledge across problems. It also supplies a separate hand-crafted diagnostic suite that breaks down model errors by specific linguistic features such as coreference or negation. Baseline experiments with multi-task and transfer methods show they produce little gain over training one model per task, which points to the need for new techniques that truly generalize.

Core claim

GLUE is a model-agnostic collection of nine NLU tasks together with a diagnostic test suite that together measure whether a system exhibits broad language understanding, and current multi-task baselines fail to improve substantially on the aggregate score obtained by training separate models per task.

What carries the argument

The GLUE benchmark itself, which combines performance scores from nine tasks with limited-data subsets and a hand-crafted diagnostic test suite for linguistic analysis.

If this is right

  • A single aggregate score can rank models on their overall language understanding ability.
  • Training regimes that move knowledge between tasks become directly measurable and rewarded.
  • The diagnostic suite can isolate which linguistic phenomena still cause models to fail.
  • Further progress requires methods that go beyond simple multi-task fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • GLUE could serve as a stable reference point for comparing new NLU systems over time.
  • Adding tasks that probe longer-range reasoning or world knowledge would test whether current high scores reflect deeper understanding.
  • If GLUE scores predict success on downstream applications, the benchmark could guide practical model selection.

Load-bearing premise

The nine chosen tasks are diverse enough to stand in for general language understanding rather than measuring narrow skills.

What would settle it

A model that scores high on the full GLUE suite but collapses on new tasks that require the same linguistic skills in fresh combinations would show the benchmark does not capture generality.

read the original abstract

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the GLUE benchmark as a collection of nine existing NLU tasks (MNLI, QQP, SST-2, CoLA, STS-B, MRPC, RTE, WNLI, QNLI) chosen for diversity in type and data size, along with a hand-crafted diagnostic test suite for linguistic analysis. It evaluates single-task, multi-task, and transfer-learning baselines and reports that the latter approaches do not yield substantial aggregate improvements over per-task training, suggesting room for better general NLU methods.

Significance. If the task collection is representative and the baseline comparisons are reproducible, the work supplies a standardized, model-agnostic platform that directly incentivizes cross-task knowledge sharing and has already become a de-facto evaluation standard. The explicit release of the benchmark, code, and diagnostic suite constitutes a concrete reproducibility strength that supports community-wide adoption and iterative improvement.

major comments (3)
  1. [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.
  2. [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.
  3. [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'
minor comments (2)
  1. [§5] §5 (diagnostic suite): a few concrete example items for each linguistic phenomenon would improve clarity and allow readers to assess the suite's coverage without consulting external resources.
  2. [Table 1] Table 1 and §3: the WNLI task description should explicitly note its known label-distribution artifacts, as these affect interpretation of model performance on that sub-task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of our manuscript and their recommendation for minor revision. We address each major comment below, indicating where we will make revisions to the paper.

read point-by-point responses
  1. Referee: [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.

    Authors: We agree that the task selection in §3 relies on qualitative arguments regarding the diversity of the tasks in terms of format, size, and the phenomena they test. While this diversity is detailed in the paper and supported by the diagnostic suite, we acknowledge the benefit of quantitative evidence. In the revised manuscript, we will add an analysis of inter-task error correlations computed from our baseline models to provide quantitative support for the tasks measuring somewhat independent capabilities. revision: yes

  2. Referee: [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.

    Authors: We thank the referee for pointing this out. The original manuscript and accompanying code release aimed to provide sufficient details, but we agree that explicit specifications are needed for full reproducibility. We will revise §4 to include the precise task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds used in our experiments. revision: yes

  3. Referee: [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'

    Authors: We agree that including variance estimates and significance tests would strengthen the experimental claims. At the time of the original submission, we reported results from single runs due to computational constraints. For the revised version, we will re-run the main single-task and multi-task experiments with multiple random seeds to report means and standard deviations, and include statistical comparisons where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: GLUE is a definitional benchmark without derivations or self-referential reductions

full rationale

The paper introduces GLUE by selecting and aggregating nine existing NLU datasets (MNLI, QQP, etc.) and adding a diagnostic suite. No equations, fitted parameters, predictions, or uniqueness theorems appear. The claim that the collection measures 'general' NLU rests on an explicit assumption of task diversity rather than any derivation that reduces to its own inputs or prior self-citations. This is a resource paper whose central contribution is definitional and externally evaluable; no load-bearing step collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the selected tasks collectively probe general NLU without introducing new free parameters, axioms beyond standard task definitions, or invented entities.

axioms (1)
  • domain assumption The selected NLU tasks are representative of general language understanding capabilities.
    Invoked in the motivation for combining tasks to incentivize sharing knowledge across limited-data settings.

pith-pipeline@v0.9.0 · 5467 in / 1077 out tokens · 41467 ms · 2026-05-12T21:18:30.925461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  2. Editing Models with Task Arithmetic

    cs.LG 2022-12 accept novelty 8.0

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  3. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  4. EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

    cs.LG 2026-05 unverdicted novelty 7.0

    EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.

  5. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  6. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  7. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  8. MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

    cs.CL 2026-04 unverdicted novelty 7.0

    MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...

  9. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  10. SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

    cs.LG 2026-04 unverdicted novelty 7.0

    LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.

  11. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  12. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  13. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  14. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  15. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    cs.CL 2019-10 accept novelty 7.0

    BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

  16. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  17. SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.

  18. AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive resul...

  19. Finding Meaning in Embeddings: Concept Separation Curves

    cs.CL 2026-04 unverdicted novelty 6.0

    Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.

  20. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  21. Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

  22. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  23. LLMs Get Lost In Multi-Turn Conversation

    cs.CL 2025-05 unverdicted novelty 6.0

    LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

  24. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  25. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  26. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  27. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  28. Linformer: Self-Attention with Linear Complexity

    cs.LG 2020-06 conditional novelty 6.0

    Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.

  29. Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

    cs.AR 2026-04 conditional novelty 5.0

    Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.

  30. A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

    cs.LG 2026-04 unverdicted novelty 5.0

    KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.

  31. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  32. BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

    cs.LG 2026-04 unverdicted novelty 5.0

    BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.

  33. Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

    cs.AR 2026-04 unverdicted novelty 4.0

    Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.

  34. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  35. GLU Variants Improve Transformer

    cs.LG 2020-02 unverdicted novelty 4.0

    Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 35 Pith papers

  1. [1]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015

  2. [2]

    The second PASCAL recognising textual entailment challenge

    Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006

  3. [3]

    The fifth PASCAL recognizing textual entailment challenge

    Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009

  4. [4]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642. Association for Computational Linguistics, 2015

  5. [5]

    Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In Eleventh International Workshop on Semantic Evaluations, 2017

  6. [6]

    One billion word benchmark for measuring progress in statistical language modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint 1312.3005, 2013

  7. [7]

    Natural language processing (almost) from scratch

    Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12 0 (Aug): 0 2493--2537, 2011

  8. [8]

    Sent E val: An evaluation toolkit for universal sentence representations

    Alexis Conneau and Douwe Kiela. Sent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018

  9. [9]

    Supervised learning of universal sentence representations from natural language inference data

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 9-11, 2017, pp.\ 681--691, 2017

  10. [10]

    Using the framework

    Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Pulman, Ted Briscoe, Holger Maier, and Karsten Konrad. Using the framework. Technical report, The F ra C a S Consortium, 1996

  11. [11]

    The PASCAL recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pp.\ 177--190. Springer, 2006

  12. [12]

    Transforming question answering datasets into natural language inference datasets

    Dorottya Demszky, Kelvin Guu, and Percy Liang. Transforming question answering datasets into natural language inference datasets. arXiv preprint 1809.02922, 2018

  13. [13]

    Automatically constructing a corpus of sentential paraphrases

    William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005

  14. [14]

    Towards linguistically generalizable NLP systems: A workshop and shared task

    Allyson Ettinger, Sudha Rao, Hal Daum \'e III, and Emily M Bender. Towards linguistically generalizable NLP systems: A workshop and shared task. In First Workshop on Building Linguistically Generalizable NLP Systems, 2017

  15. [15]

    Liu, Matthew Peters, Michael Schmitz, and Luke S

    Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. Allen NLP : A deep semantic natural language processing platform. 2017

  16. [16]

    The third PASCAL recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.\ 1--9. Association for Computational Linguistics, 2007

  17. [17]

    Comparing two k-category assignments by a k-category correlation coefficient

    Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem., 28 0 (5-6): 0 367--374, December 2004. ISSN 1476-9271

  18. [18]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  19. [19]

    A joint many-task model: Growing a neural network for multiple nlp tasks

    Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017

  20. [20]

    Learning distributed representations of sentences from unlabelled data

    Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016

  21. [21]

    Mining and summarizing customer reviews

    Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 168--177. ACM, 2004

  22. [22]

    Bag of tricks for efficient text classification

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint 1607.01759, 2016

  23. [23]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015

  24. [24]

    Skip- T hought vectors

    Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip- T hought vectors. In Advances in Neural Information Processing Systems, pp.\ 3294--3302, 2015

  25. [25]

    Distributed representations of sentences and documents

    Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.\ 1188--1196, Bejing, China, 22--24 Jun 2014. PMLR

  26. [26]

    The W inograd schema challenge

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, pp.\ 47, 2011

  27. [27]

    Comparison of the predicted and observed secondary structure of t4 phage lysozyme

    Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405 0 (2): 0 442--451, 1975

  28. [28]

    Learned in translation: Contextualized word vectors

    Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pp.\ 6297--6308, 2017

  29. [29]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint 1806.08730, 2018

  30. [30]

    Thomas McCoy and Tal Linzen

    R. Thomas McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language inference. In Proceedings of the Society for Computation in Linguistics, volume 2, pp.\ 357--360, 2019

  31. [31]

    Dissent: Sentence representation learning from explicit discourse relations

    Allen Nie, Erin D Bennett, and Noah D Goodman. Dissent: Sentence representation learning from explicit discourse relations. arXiv preprint 1710.04334, 2017

  32. [32]

    A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts

    Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp.\ 271. Association for Computational Linguistics, 2004

  33. [33]

    Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

    Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp.\ 115--124. Association for Computational Linguistics, 2005

  34. [34]

    G lo V e: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher Manning. G lo V e: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language processing, pp.\ 1532--1543, 2014

  35. [35]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  36. [36]

    Hypothesis only baselines in natural language inference

    Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In *SEM@NAACL-HLT, 2018

  37. [37]

    SQ u AD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392. Association for Computational Linguistics, 2016

  38. [38]

    Reasoning about entailment with neural attention

    Tim Rockt \"a schel, Edward Grefenstette, Moritz Hermann, Karl, Tom \'a s Ko c isk \`y , and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations, 2016

  39. [39]

    Sluice networks: Learning what to share between loosely related tasks

    Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders S gaard. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint 1705.08142, 2017

  40. [40]

    Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of CoNLL, 2017

  41. [41]

    Bidirectional attention flow for machine comprehension

    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference of Learning Representations, 2017

  42. [42]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

  43. [43]

    Deep multi-task learning with low level tasks supervised at lower layers

    Anders S gaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pp.\ 231--235, 2016

  44. [44]

    Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In Proceedings of the International Conference on Learning Representations, 2018

  45. [45]

    Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

    Masatoshi Tsuchiya. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA)

  46. [46]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 6000--6010, 2017

  47. [47]

    The TREC -8 question answering track report

    Ellen M Voorhees et al. The TREC -8 question answering track report. In TREC, volume 99, pp.\ 77--82, 1999

  48. [48]

    Neural network acceptability judgments

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018

  49. [49]

    Inference is everything: Recasting semantic resources into a unified evaluation framework

    Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp.\ 996--1005, 2017

  50. [50]

    Annotating expressions of opinions and emotions in language

    Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. In Proceedings of the International Conference on Language Resources and Evaluation, volume 39, pp.\ 165--210. Springer, 2005

  51. [51]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  52. [52]

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the International Conference on Computer Vision, pp.\ 19--27, 2015