Recognition: 1 theorem link
· Lean TheoremGLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Pith reviewed 2026-05-12 21:18 UTC · model grok-4.3
The pith
GLUE supplies a benchmark of nine NLU tasks plus diagnostics to test models for general rather than task-specific language understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLUE is a model-agnostic collection of nine NLU tasks together with a diagnostic test suite that together measure whether a system exhibits broad language understanding, and current multi-task baselines fail to improve substantially on the aggregate score obtained by training separate models per task.
What carries the argument
The GLUE benchmark itself, which combines performance scores from nine tasks with limited-data subsets and a hand-crafted diagnostic test suite for linguistic analysis.
If this is right
- A single aggregate score can rank models on their overall language understanding ability.
- Training regimes that move knowledge between tasks become directly measurable and rewarded.
- The diagnostic suite can isolate which linguistic phenomena still cause models to fail.
- Further progress requires methods that go beyond simple multi-task fine-tuning.
Where Pith is reading between the lines
- GLUE could serve as a stable reference point for comparing new NLU systems over time.
- Adding tasks that probe longer-range reasoning or world knowledge would test whether current high scores reflect deeper understanding.
- If GLUE scores predict success on downstream applications, the benchmark could guide practical model selection.
Load-bearing premise
The nine chosen tasks are diverse enough to stand in for general language understanding rather than measuring narrow skills.
What would settle it
A model that scores high on the full GLUE suite but collapses on new tasks that require the same linguistic skills in fresh combinations would show the benchmark does not capture generality.
read the original abstract
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the GLUE benchmark as a collection of nine existing NLU tasks (MNLI, QQP, SST-2, CoLA, STS-B, MRPC, RTE, WNLI, QNLI) chosen for diversity in type and data size, along with a hand-crafted diagnostic test suite for linguistic analysis. It evaluates single-task, multi-task, and transfer-learning baselines and reports that the latter approaches do not yield substantial aggregate improvements over per-task training, suggesting room for better general NLU methods.
Significance. If the task collection is representative and the baseline comparisons are reproducible, the work supplies a standardized, model-agnostic platform that directly incentivizes cross-task knowledge sharing and has already become a de-facto evaluation standard. The explicit release of the benchmark, code, and diagnostic suite constitutes a concrete reproducibility strength that supports community-wide adoption and iterative improvement.
major comments (3)
- [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.
- [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.
- [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'
minor comments (2)
- [§5] §5 (diagnostic suite): a few concrete example items for each linguistic phenomenon would improve clarity and allow readers to assess the suite's coverage without consulting external resources.
- [Table 1] Table 1 and §3: the WNLI task description should explicitly note its known label-distribution artifacts, as these affect interpretation of model performance on that sub-task.
Simulated Author's Rebuttal
We thank the referee for their careful reading of our manuscript and their recommendation for minor revision. We address each major comment below, indicating where we will make revisions to the paper.
read point-by-point responses
-
Referee: [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.
Authors: We agree that the task selection in §3 relies on qualitative arguments regarding the diversity of the tasks in terms of format, size, and the phenomena they test. While this diversity is detailed in the paper and supported by the diagnostic suite, we acknowledge the benefit of quantitative evidence. In the revised manuscript, we will add an analysis of inter-task error correlations computed from our baseline models to provide quantitative support for the tasks measuring somewhat independent capabilities. revision: yes
-
Referee: [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.
Authors: We thank the referee for pointing this out. The original manuscript and accompanying code release aimed to provide sufficient details, but we agree that explicit specifications are needed for full reproducibility. We will revise §4 to include the precise task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds used in our experiments. revision: yes
-
Referee: [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'
Authors: We agree that including variance estimates and significance tests would strengthen the experimental claims. At the time of the original submission, we reported results from single runs due to computational constraints. For the revised version, we will re-run the main single-task and multi-task experiments with multiple random seeds to report means and standard deviations, and include statistical comparisons where appropriate. revision: yes
Circularity Check
No circularity: GLUE is a definitional benchmark without derivations or self-referential reductions
full rationale
The paper introduces GLUE by selecting and aggregating nine existing NLU datasets (MNLI, QQP, etc.) and adding a diagnostic suite. No equations, fitted parameters, predictions, or uniqueness theorems appear. The claim that the collection measures 'general' NLU rests on an explicit assumption of task diversity rather than any derivation that reduces to its own inputs or prior self-citations. This is a resource paper whose central contribution is definitional and externally evaluable; no load-bearing step collapses by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected NLU tasks are representative of general language understanding capabilities.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearWe introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.
Forward citations
Cited by 35 Pith papers
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Analysis and Explainability of LLMs Via Evolutionary Methods
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
-
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
-
Winner-Take-All Spiking Transformer for Language Modeling
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
-
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
-
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive resul...
-
Finding Meaning in Embeddings: Concept Separation Curves
Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices
Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.
-
A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.
-
Adaptive Spiking Neurons for Vision and Language Modeling
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
-
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
-
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
GLU Variants Improve Transformer
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Reference graph
Works this paper leans on
-
[1]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015
work page 2015
-
[2]
The second PASCAL recognising textual entailment challenge
Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006
work page 2006
-
[3]
The fifth PASCAL recognizing textual entailment challenge
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009
work page 2009
-
[4]
Bowman, Gabor Angeli, Christopher Potts, and Christopher D
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642. Association for Computational Linguistics, 2015
work page 2015
-
[5]
Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In Eleventh International Workshop on Semantic Evaluations, 2017
work page 2017
-
[6]
One billion word benchmark for measuring progress in statistical language modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint 1312.3005, 2013
-
[7]
Natural language processing (almost) from scratch
Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12 0 (Aug): 0 2493--2537, 2011
work page 2011
-
[8]
Sent E val: An evaluation toolkit for universal sentence representations
Alexis Conneau and Douwe Kiela. Sent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018
work page 2018
-
[9]
Supervised learning of universal sentence representations from natural language inference data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 9-11, 2017, pp.\ 681--691, 2017
work page 2017
-
[10]
Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Pulman, Ted Briscoe, Holger Maier, and Karsten Konrad. Using the framework. Technical report, The F ra C a S Consortium, 1996
work page 1996
-
[11]
The PASCAL recognising textual entailment challenge
Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pp.\ 177--190. Springer, 2006
work page 2006
-
[12]
Transforming question answering datasets into natural language inference datasets
Dorottya Demszky, Kelvin Guu, and Percy Liang. Transforming question answering datasets into natural language inference datasets. arXiv preprint 1809.02922, 2018
-
[13]
Automatically constructing a corpus of sentential paraphrases
William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005
work page 2005
-
[14]
Towards linguistically generalizable NLP systems: A workshop and shared task
Allyson Ettinger, Sudha Rao, Hal Daum \'e III, and Emily M Bender. Towards linguistically generalizable NLP systems: A workshop and shared task. In First Workshop on Building Linguistically Generalizable NLP Systems, 2017
work page 2017
-
[15]
Liu, Matthew Peters, Michael Schmitz, and Luke S
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. Allen NLP : A deep semantic natural language processing platform. 2017
work page 2017
-
[16]
The third PASCAL recognizing textual entailment challenge
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.\ 1--9. Association for Computational Linguistics, 2007
work page 2007
-
[17]
Comparing two k-category assignments by a k-category correlation coefficient
Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem., 28 0 (5-6): 0 367--374, December 2004. ISSN 1476-9271
work page 2004
-
[18]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018
work page 2018
-
[19]
A joint many-task model: Growing a neural network for multiple nlp tasks
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017
work page 2017
-
[20]
Learning distributed representations of sentences from unlabelled data
Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016
work page 2016
-
[21]
Mining and summarizing customer reviews
Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 168--177. ACM, 2004
work page 2004
-
[22]
Bag of tricks for efficient text classification
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint 1607.01759, 2016
-
[23]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015
work page 2015
-
[24]
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip- T hought vectors. In Advances in Neural Information Processing Systems, pp.\ 3294--3302, 2015
work page 2015
-
[25]
Distributed representations of sentences and documents
Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.\ 1188--1196, Bejing, China, 22--24 Jun 2014. PMLR
work page 2014
-
[26]
The W inograd schema challenge
Hector J Levesque, Ernest Davis, and Leora Morgenstern. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, pp.\ 47, 2011
work page 2011
-
[27]
Comparison of the predicted and observed secondary structure of t4 phage lysozyme
Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405 0 (2): 0 442--451, 1975
work page 1975
-
[28]
Learned in translation: Contextualized word vectors
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pp.\ 6297--6308, 2017
work page 2017
-
[29]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint 1806.08730, 2018
work page Pith review arXiv 2018
-
[30]
R. Thomas McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language inference. In Proceedings of the Society for Computation in Linguistics, volume 2, pp.\ 357--360, 2019
work page 2019
-
[31]
Dissent: Sentence representation learning from explicit discourse relations
Allen Nie, Erin D Bennett, and Noah D Goodman. Dissent: Sentence representation learning from explicit discourse relations. arXiv preprint 1710.04334, 2017
-
[32]
A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts
Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp.\ 271. Association for Computational Linguistics, 2004
work page 2004
-
[33]
Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp.\ 115--124. Association for Computational Linguistics, 2005
work page 2005
-
[34]
G lo V e: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher Manning. G lo V e: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language processing, pp.\ 1532--1543, 2014
work page 2014
-
[35]
Deep contextualized word representations
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018
work page 2018
-
[36]
Hypothesis only baselines in natural language inference
Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In *SEM@NAACL-HLT, 2018
work page 2018
-
[37]
SQ u AD : 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392. Association for Computational Linguistics, 2016
work page 2016
-
[38]
Reasoning about entailment with neural attention
Tim Rockt \"a schel, Edward Grefenstette, Moritz Hermann, Karl, Tom \'a s Ko c isk \`y , and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations, 2016
work page 2016
-
[39]
Sluice networks: Learning what to share between loosely related tasks
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders S gaard. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint 1705.08142, 2017
-
[40]
Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of CoNLL, 2017
work page 2017
-
[41]
Bidirectional attention flow for machine comprehension
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference of Learning Representations, 2017
work page 2017
-
[42]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013
work page 2013
-
[43]
Deep multi-task learning with low level tasks supervised at lower layers
Anders S gaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pp.\ 231--235, 2016
work page 2016
-
[44]
Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In Proceedings of the International Conference on Learning Representations, 2018
work page 2018
-
[45]
Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment
Masatoshi Tsuchiya. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA)
work page 2018
-
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 6000--6010, 2017
work page 2017
-
[47]
The TREC -8 question answering track report
Ellen M Voorhees et al. The TREC -8 question answering track report. In TREC, volume 99, pp.\ 77--82, 1999
work page 1999
-
[48]
Neural network acceptability judgments
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018
-
[49]
Inference is everything: Recasting semantic resources into a unified evaluation framework
Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp.\ 996--1005, 2017
work page 2017
-
[50]
Annotating expressions of opinions and emotions in language
Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. In Proceedings of the International Conference on Language Resources and Evaluation, volume 39, pp.\ 165--210. Springer, 2005
work page 2005
-
[51]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018
work page 2018
-
[52]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the International Conference on Computer Vision, pp.\ 19--27, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.