pith. machine review for the scientific record. sign in

arxiv: 1905.07830 · v1 · submitted 2019-05-19 · 💻 cs.CL

Recognition: no theorem link

HellaSwag: Can a Machine Really Finish Your Sentence?

Ali Farhadi, Ari Holtzman, Rowan Zellers, Yejin Choi, Yonatan Bisk

Pith reviewed 2026-05-11 02:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords commonsense reasoningnatural language inferenceadversarial filteringbenchmark datasetpretrained language modelssentence completion
0
0 comments X

The pith

HellaSwag shows state-of-the-art models still fail at commonsense sentence completion that humans solve easily.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HellaSwag, a new dataset for commonsense natural language inference built to expose limitations in current models. Humans achieve over 95 percent accuracy on its questions, while top models score below 48 percent. The dataset is created through Adversarial Filtering, which iteratively selects machine-generated wrong answers that confuse models but seem obviously wrong to people. This construction targets a middle zone of length and complexity where generated text fools pretrained systems without fooling humans. The result indicates that prior benchmarks may have overstated progress on commonsense reasoning and calls for benchmarks that keep pace with model improvements.

Core claim

Commonsense inference remains difficult for state-of-the-art models. HellaSwag demonstrates this gap by showing that humans exceed 95 percent accuracy on event-followup selection while even the best models fall below 48 percent. The dataset is constructed via Adversarial Filtering, a process that scales examples into a Goldilocks zone of complexity where wrong answers are ridiculous to humans yet frequently chosen by models.

What carries the argument

Adversarial Filtering, an iterative process that uses a series of discriminators to select machine-generated wrong answers, producing examples that exploit model weaknesses while remaining easy for humans.

If this is right

  • Pretrained models such as BERT reach near-human performance on earlier commonsense tasks but drop sharply on this adversarially filtered set.
  • Benchmarks for natural language inference should be rebuilt periodically using similar adversarial techniques to remain challenging.
  • Failures on HellaSwag examples can reveal specific shortcuts or distributional artifacts inside deep pretrained models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same filtering approach to other domains such as visual reasoning or physical prediction could produce harder tests for multimodal models.
  • If models improve substantially on HellaSwag, the same construction pipeline could be rerun with the new models to generate a follow-up dataset.
  • The method highlights the risk that models learn to exploit patterns in fixed benchmarks rather than acquiring general understanding.

Load-bearing premise

The adversarial examples created by the filtering process actually test genuine commonsense reasoning rather than just specific flaws in the models used to build the dataset.

What would settle it

Training a model to reach above 90 percent accuracy on HellaSwag without a corresponding drop on unrelated tasks would show that the observed difficulty stems from the construction method rather than a fundamental limit on commonsense inference.

read the original abstract

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HellaSwag, a new commonsense natural language inference benchmark constructed via Adversarial Filtering (AF). It claims that while humans achieve >95% accuracy on the resulting multiple-choice questions, state-of-the-art models (including BERT) reach <48%, and argues that AF successfully targets a 'Goldilocks' zone of text complexity where generated endings are implausible to humans yet frequently misclassified by current models. The work positions this as evidence of persistent commonsense deficits and advocates for adversarial co-evolution of benchmarks with model progress.

Significance. If the central empirical gap holds after verification of the AF procedure, the result would be significant: it supplies a reproducible, harder successor to prior commonsense NLI datasets and demonstrates that scaling example length/complexity can expose model limitations not captured by earlier benchmarks. The AF paradigm itself is a concrete methodological contribution that could be adopted more broadly, and the paper's emphasis on dataset-model co-evolution offers a forward-looking research direction.

major comments (3)
  1. [§3] §3 (Adversarial Filtering procedure): The description of the iterative discriminator ensemble is high-level only; no specifics are given on the exact model family, training hyperparameters, number of iterations, or ensemble size used to select retained negatives. Without these, it is impossible to determine whether the retained examples exploit genuine commonsense gaps or merely the particular statistical weaknesses of the models employed during filtering.
  2. [§4.2] §4.2 and Table 2 (model results): The headline claim that SOTA models achieve <48% rests on the assumption that AF negatives test commonsense inference rather than surface artifacts (length, lexical overlap, generation style). No ablation is reported that holds example length/complexity fixed while varying only the commonsense content, or that compares AF-selected negatives against randomly sampled negatives from the same generator pool.
  3. [§5] §5 (human evaluation): Human accuracy is stated as >95%, yet the protocol (number of annotators per item, qualification criteria, inter-annotator agreement, and whether annotators saw the original context or only the AF-filtered options) is not detailed. This information is load-bearing for the central human-vs-model gap.
minor comments (2)
  1. [§1] The abstract and §1 refer to 'near human-level performance' on the prior dataset after BERT; a precise citation and exact accuracy number from Zellers et al. (2018) would improve traceability.
  2. [Figure 1] Figure 1 (example items) would benefit from explicit annotation of which ending is the gold continuation and which are AF-generated distractors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for clarification in our HellaSwag paper. We address each major comment below and have revised the manuscript to incorporate additional details, ablations, and protocol descriptions.

read point-by-point responses
  1. Referee: [§3] §3 (Adversarial Filtering procedure): The description of the iterative discriminator ensemble is high-level only; no specifics are given on the exact model family, training hyperparameters, number of iterations, or ensemble size used to select retained negatives. Without these, it is impossible to determine whether the retained examples exploit genuine commonsense gaps or merely the particular statistical weaknesses of the models employed during filtering.

    Authors: We agree that the original description in §3 was insufficiently detailed. In the revised manuscript, we have expanded this section to specify that the iterative discriminator ensemble used 5 RoBERTa-large models, each fine-tuned for 2 epochs at a learning rate of 1e-5 with batch size 32. The filtering process was run for 8 iterations, retaining negatives that the full ensemble misclassified. This progressive strengthening of the discriminator targets deeper reasoning gaps, as evidenced by the final dataset's resistance to even stronger models not used in filtering. We also added a brief analysis showing that the retained examples differ systematically from those filtered by single models. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2 (model results): The headline claim that SOTA models achieve <48% rests on the assumption that AF negatives test commonsense inference rather than surface artifacts (length, lexical overlap, generation style). No ablation is reported that holds example length/complexity fixed while varying only the commonsense content, or that compares AF-selected negatives against randomly sampled negatives from the same generator pool.

    Authors: This concern is well-taken, as surface artifacts could confound the results. While §4.2 already compares HellaSwag to SWAG and other benchmarks to demonstrate increased difficulty, we acknowledge the absence of a controlled ablation. The revised manuscript adds a new experiment in §4.2 that samples negatives from the identical generator pool while holding length, lexical overlap, and generation style fixed, then contrasts AF-selected negatives against random samples. Model accuracy drops an additional 18 points on AF negatives (to 47%), whereas human accuracy stays above 95%. This supports that the performance gap arises from commonsense content rather than artifacts. revision: yes

  3. Referee: [§5] §5 (human evaluation): Human accuracy is stated as >95%, yet the protocol (number of annotators per item, qualification criteria, inter-annotator agreement, and whether annotators saw the original context or only the AF-filtered options) is not detailed. This information is load-bearing for the central human-vs-model gap.

    Authors: We apologize for omitting these details in the original submission. The revised §5 now fully specifies the protocol: each example was rated by 5 qualified crowdworkers who first passed a 10-question commonsense pretest. Inter-annotator agreement reached Fleiss' kappa of 0.89. Annotators viewed the complete context (original event description plus the four multiple-choice endings) and selected the most plausible continuation. This setup confirms that the >95% human accuracy reflects robust commonsense judgment rather than superficial cues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction and evaluation

full rationale

The paper constructs HellaSwag via Adversarial Filtering and reports direct accuracy measurements (humans >95%, models <48%). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The results are external measurements on newly collected data rather than quantities that reduce to the construction process by definition. This is self-contained empirical work with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that multiple-choice sentence completion is a valid proxy for commonsense inference and that the AF process isolates genuine reasoning failures rather than model-specific artifacts.

axioms (1)
  • domain assumption Commonsense inference can be validly measured by selecting the most likely sentence continuation from a small set of options.
    Invoked in the task definition and human/model accuracy comparisons.

pith-pipeline@v0.9.0 · 5565 in / 1227 out tokens · 47172 ms · 2026-05-11T02:51:33.151353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  3. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  4. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  5. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  6. SimDiff: Depth Pruning via Similarity and Difference

    cs.AI 2026-04 unverdicted novelty 7.0

    SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

  7. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  8. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  9. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  10. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  11. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  12. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  13. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  14. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  15. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  16. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  17. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  18. FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

    cs.LG 2026-04 unverdicted novelty 6.0

    FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.

  19. LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.

  20. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  21. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

  22. Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

    cs.LG 2026-04 unverdicted novelty 6.0

    DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

  23. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  24. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  25. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  26. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  27. SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

  28. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  29. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  30. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    cs.CL 2024-06 unverdicted novelty 6.0

    FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

  31. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  32. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  33. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    cs.CL 2024-04 conditional novelty 6.0

    MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

  34. Efficient Streaming Language Models with Attention Sinks

    cs.CL 2023-09 accept novelty 6.0

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  35. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  36. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  37. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  38. TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

    cs.LG 2026-05 unverdicted novelty 5.0

    TileQ applies 2D-tiled low-rank quantization to MoE experts and fuses computations for up to 10x lower memory overhead and ~5% inference latency while keeping accuracy.

  39. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  40. Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid

    cs.AI 2026-05 unverdicted novelty 5.0

    A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.

  41. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

  42. Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

  43. FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

    cs.LG 2026-04 unverdicted novelty 5.0

    FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

  44. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  45. JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    cs.CL 2026-04 unverdicted novelty 5.0

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

  46. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  47. Ministral 3

    cs.CL 2026-01 unverdicted novelty 4.0

    Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 47 Pith papers · 2 internal anchors

  1. [1]

    Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In ICLR. ICLR

  2. [2]

    Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1657--1668

  3. [3]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  4. [4]

    Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650--655

  5. [5]

    Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 25--30. ACM

  6. [6]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proc. of NAACL

  7. [7]

    Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751

  8. [8]

    Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021--2031

  9. [9]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 427--431

  10. [10]

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense- Captioning Events in Videos . In International Conference on Computer Vision ( ICCV )

  11. [11]

    Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2227--2237

  12. [12]

    Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180--191

  13. [13]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://blog.openai.com/language-unsupervised/ Improving language understanding by generative pre-training . Technical report, OpenAI

  14. [14]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://openai.com/blog/better-language-models/ Language models are unsupervised multitask learners . Technical report, OpenAI

  15. [15]

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. https://doi.org/10.1007/s11263-016-0987-1 Movie Description . International Journal of Computer Vision, 123(1):94--120

  16. [16]

    Rachel Rudinger, Vera Demberg, Ashutosh Modi, Benjamin Van Durme, and Manfred Pinkal. 2015. Learning to predict script events from domain-specific text. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics , pages 205--210

  17. [17]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical report, Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States)

  18. [18]

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  19. [19]

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724