pith. machine review for the scientific record. sign in

arxiv: 1905.10044 · v1 · submitted 2019-05-24 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Kristina Toutanova, Michael Collins, Ming-Wei Chang, Tom Kwiatkowski

Pith reviewed 2026-05-13 09:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords yes/no questionsreading comprehensionnatural language inferencetransfer learningquestion answeringBERTBoolQ
0
0 comments X

The pith

Natural yes/no questions prove harder for models than expected even after strong pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds BoolQ, a reading comprehension dataset drawn from naturally occurring yes/no questions rather than crowdsourced prompts. These questions typically ask for complex inferences that resemble entailment judgments instead of simple fact extraction from the passage. Transfer from entailment corpora such as MultiNLI improves results more than transfer from paraphrase or extractive QA sources. The benefit persists even when the starting point is a large pre-trained model like BERT. The strongest system reaches 80.4 percent accuracy, against 90 percent for human annotators and 62 percent for a majority baseline.

Core claim

BoolQ consists of yes/no questions generated in unprompted settings paired with Wikipedia passages. Solving them often requires difficult entailment-like reasoning over non-factoid information. Training on MultiNLI before fine-tuning on BoolQ is the most effective transfer strategy, and it continues to help even when the model begins as BERT. This procedure yields 80.4 percent accuracy, leaving a sizable gap relative to the 90 percent human ceiling.

What carries the argument

The BoolQ dataset of naturally occurring yes/no questions, used to measure how well models perform complex inference beyond fact lookup.

If this is right

  • Transfer from MultiNLI data improves accuracy on BoolQ more than transfer from paraphrase or extractive QA data.
  • Even BERT continues to benefit from an intermediate MultiNLI training stage before fine-tuning on BoolQ.
  • Natural yes/no questions frequently require non-factoid information and entailment-style inference rather than direct span extraction.
  • A performance gap of roughly ten points remains between the best model and human annotators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on BoolQ would likely improve models on other realistic query types that mix reasoning with passage understanding.
  • The pattern that entailment pre-training helps question answering may apply to additional tasks that hinge on implicit inference.
  • Datasets built from unprompted user questions could expose similar gaps in other language-understanding benchmarks.

Load-bearing premise

The collected questions faithfully represent the distribution of yes/no questions that arise in everyday language use and that the provided answers contain little ambiguity or annotator bias.

What would settle it

A model trained without any entailment data that reaches or exceeds 90 percent accuracy on the BoolQ test set would undermine the claim that these natural questions systematically demand harder reasoning than current techniques can supply.

read the original abstract

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces BoolQ, a reading comprehension dataset of naturally occurring yes/no questions paired with passages from web sources. It argues that these questions are unexpectedly difficult, often requiring complex non-factoid inference akin to textual entailment. The authors evaluate transfer learning baselines and find that fine-tuning BERT after MultiNLI pre-training achieves 80.4% accuracy, compared to a 62% majority baseline and 90% human performance, leaving a substantial gap for future work.

Significance. If the dataset construction and labels are reliable, BoolQ provides a useful benchmark highlighting limitations of current models on natural yes/no questions even after strong pre-training. The empirical finding that entailment transfer outperforms paraphrase or extractive QA transfer is a concrete, actionable result that could guide future QA and NLI research. The work ships a new dataset with concrete accuracy numbers and baseline comparisons, which strengthens its contribution as an empirical resource.

major comments (3)
  1. [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.
  2. [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.
  3. [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'surprising difficulty' without a quantitative comparison to prior yes/no QA datasets (e.g., on SQuAD or NewsQA yes/no subsets); adding this would better motivate the contribution.
  2. [Figure 1] Figure 1 (example questions) would benefit from explicit annotation of the inference steps required, to illustrate the 'entailment-like' nature claimed in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable comments. We address each of the major comments below and have made revisions to the manuscript to incorporate additional details on dataset construction, experimental variance, and performance breakdowns.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.

    Authors: We agree that inter-annotator agreement and details on the annotation process are important to establish label quality. Although not reported in the initial submission, we have now computed inter-annotator agreement on a sample of the data and included the statistics, the adjudication protocol, and an analysis of ambiguous cases in the revised manuscript. These additions show that agreement is high and ambiguous cases are few, supporting that the gap is not primarily due to label noise. revision: yes

  2. Referee: [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.

    Authors: We acknowledge that reporting variance and statistical significance would make the transfer learning results more convincing. We have conducted additional experiments across multiple random seeds and performed significance testing. The revised manuscript now includes the standard deviations and confirms that the improvement from MultiNLI pre-training is statistically significant. revision: yes

  3. Referee: [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.

    Authors: We agree that breaking down the results would help identify where models struggle. We have added such an analysis to the revised paper, including performance by question type and passage length. This breakdown reveals that the performance gap persists across different categories, indicating a broad challenge rather than localized issues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and evaluation

full rationale

The paper constructs a new reading-comprehension dataset BoolQ from naturally occurring yes/no questions and reports direct empirical accuracies (80.4% for the best BERT+MultiNLI transfer baseline versus 90% human and 62% majority). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; performance figures are measured on held-out test data after standard training, with no reduction to inputs by construction. Human annotation serves as an external benchmark rather than a self-referential definition. The work is therefore self-contained against external data and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard NLP assumptions about annotation quality and model transferability rather than new postulates.

axioms (1)
  • domain assumption Human annotators provide reliable ground-truth labels for yes/no questions
    Invoked when reporting 90% human accuracy as the upper bound

pith-pipeline@v0.9.0 · 5466 in / 1071 out tokens · 25195 ms · 2026-05-13T09:42:05.357041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...

  3. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  4. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  5. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  6. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  7. SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    cs.LG 2026-05 unverdicted novelty 6.0

    SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

  8. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  9. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  10. Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

    cs.CL 2026-05 unverdicted novelty 6.0

    A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...

  11. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...

  12. River-LLM: Large Language Model Seamless Exit Based on KV Share

    cs.CL 2026-04 unverdicted novelty 6.0

    River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving...

  13. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  14. Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

    cs.LG 2026-04 unverdicted novelty 6.0

    DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

  15. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  16. SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

  17. Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

    cs.LG 2026-04 unverdicted novelty 6.0

    PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.

  18. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  19. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    cs.AI 2024-08 unverdicted novelty 6.0

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  20. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  21. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  22. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  23. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  24. Gemma: Open Models Based on Gemini Research and Technology

    cs.CL 2024-03 accept novelty 4.0

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  25. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

  26. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 26 Pith papers · 2 internal anchors

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA : V isual Q uestion A nswering. In Proceedings of the IEEE international conference on computer vision

  2. [2]

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. T he S ixth PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

  3. [3]

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2011. T he S eventh PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

  4. [4]

    Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A L arge A nnotated C orpus for L earning N atural L anguage I nference . In EMNLP

  5. [5]

    Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. https://doi.org/10.18653/v1/P17-1152 E nhanced LSTM for N atural L anguage I nference . In ACL

  6. [6]

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. https://www.aclweb.org/anthology/D18-1241 Q ua C : Q uestion A nswering in C ontext . In EMNLP

  7. [7]

    Alexis Conneau and Douwe Kiela. 2018. https://www.aclweb.org/anthology/L18-1269 S enteval: A n E valuation T oolkit for U niversal S entence R epresentations . In LREC

  8. [8]

    Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018. https://arxiv.org/abs/1809.02922 T ransforming Q uestion A nswering D atasets I nto N atural L anguage I nference D atasets . Computing Research Repository, arXiv:1809.02922. Version 2

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT : P re-training of D eep B idirectional T ransformers for L anguage U nderstanding . Computing Research Repository, arXiv:1810.04805. Version 1

  10. [10]

    Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. https://www.aclweb.org/anthology/P18-2103 B reaking NLI S ystems with S entences that R equire S imple L exical I nferences . In ACL

  11. [11]

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. https://doi.org/10.18653/v1/N18-2017 A nnotation A rtifacts in N atural L anguage I nference D ata . In NAACL

  12. [12]

    Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. R ead+ V erify: M achine R eading C omprehension with U nanswerable Q uestions. In CoRR

  13. [13]

    Robin Jia and Percy Liang. 2017. https://doi.org/10.18653/v1/D17-1215 A dversarial E xamples for E valuating R eading C omprehension S ystems . In EMNLP

  14. [14]

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T riviaqa: A L arge S cale D istantly S upervised C hallenge D ataset for R eading C omprehension . In ACL

  15. [15]

    Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. S ci T ail: A T extual E ntailment D ataset from S cience Q uestion A nswering. In AAAI

  16. [16]

    Diederik P Kingma and Jimmy Ba. 2014. A dam: A M ethod for S tochastic O ptimization. In ICLR

  17. [17]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. N atural Q uestions: a B enchmark for Q uestion A nswering R esea...

  18. [18]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 R ace: L arge- S cale R eading C omprehension D ataset from E xaminations . In EMNLP

  19. [19]

    R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://arxiv.org/abs/1902.01007 R ight for the W rong R easons: D iagnosing S yntactic H euristics in N atural L anguage I nference . Computing Research Repository, arXiv:1902.01007. Version 1

  20. [20]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://www.aclweb.org/anthology/D18-1260 C an a S uit of A rmor C onduct E lectricity? A N ew D ataset for O pen B ook Q uestion A nswering . In EMNLP

  21. [21]

    Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. A dvances in P re- T raining D istributed W ord R epresentations. In LREC

  22. [22]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO : A H uman G enerated M achine R eading C omprehension D ataset . Computing Research Repository, arXiv:1611.09268. Version 3

  23. [23]

    a ckstr \

    Ankur P Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A D ecomposable A ttention M odel for N atural L anguage I nference . In EMNLP

  24. [24]

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 D eep C ontextualized W ord R epresentations . In NAACL

  25. [25]

    Jason Phang, Thibault F \'e vry, and Samuel R Bowman. 2018. https://arxiv.org/abs/1811.01088 S entence E ncoders on STILT s: S upplementary T raining on I ntermediate L abeled-data T asks . Computing Research Repository, arXiv:1811.01088. Version 2

  26. [26]

    Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. https://www.aclweb.org/anthology/D18-1007 C ollecting D iverse N atural L anguage I nference P roblems for S entence R epresentation E valuation . In EMNLP

  27. [27]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf I mproving L anguage U nderstanding by G enerative P re-training

  28. [28]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://www.aclweb.org/anthology/P18-2124 K now W hat Y ou D on't K now: U nanswerable Q uestions for SQuAD . In ACL

  29. [29]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 S quad: 100,000+ Q uestions for M achine C omprehension of T ext . In EMNLP

  30. [30]

    Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. C o QA : A C onversational Q uestion A nswering C hallenge. In TACL

  31. [31]

    Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/D18-1233 I nterpretation of N atural L anguage R ules in C onversational M achine R eading . In EMNLP

  32. [32]

    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. B idirectional A ttention F low for M achine C omprehension. In ICLR

  33. [33]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://www.aclweb.org/anthology/W18-5446 GLUE : A M ulti- T ask B enchmark and A nalysis P latform for N atural L anguage U nderstanding . In EMNLP

  34. [34]

    Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/Q18-1021 C onstructing D atasets for M ulti-hop R eading C omprehension A cross D ocuments . In ACL

  35. [35]

    Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri \"e nboer, Armand Joulin, and Tomas Mikolov. 2015. T owards AI - C omplete Q uestion A nswering: A S et of P rerequisite T oy T asks. In ICLR

  36. [36]

    Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A B road- C overage C hallenge C orpus for S entence U nderstanding through I nference . In NAACL

  37. [37]

    Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual Q uestion A nswering: A S urvey of M ethods and D atasets. In Computer Vision and Image Understanding. Elsevier

  38. [38]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. https://www.aclweb.org/anthology/D18-1259 H otpotqa: A D ataset for D iverse, E xplainable M ulti-hop Q uestion A nswering . In EMNLP

  39. [39]

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. https://www.aclweb.org/anthology/D18-1009 S wag: A L arge- S cale A dversarial D ataset for G rounded C ommonsense I nference . In EMNLP

  40. [40]

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 R e C o R D : B ridging the G ap between H uman and M achine C ommonsense R eading C omprehension . Computing Research Repository, arXiv:1810.12885. Version 1

  41. [41]

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning B ooks and M ovies: T owards S tory- L ike V isual E xplanations by W atching M ovies and R eading B ooks. In Proceedings of the IEEE international conference on computer vision, pages 19--27