Recognition: 2 theorem links
· Lean TheoremBoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Pith reviewed 2026-05-13 09:42 UTC · model grok-4.3
The pith
Natural yes/no questions prove harder for models than expected even after strong pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BoolQ consists of yes/no questions generated in unprompted settings paired with Wikipedia passages. Solving them often requires difficult entailment-like reasoning over non-factoid information. Training on MultiNLI before fine-tuning on BoolQ is the most effective transfer strategy, and it continues to help even when the model begins as BERT. This procedure yields 80.4 percent accuracy, leaving a sizable gap relative to the 90 percent human ceiling.
What carries the argument
The BoolQ dataset of naturally occurring yes/no questions, used to measure how well models perform complex inference beyond fact lookup.
If this is right
- Transfer from MultiNLI data improves accuracy on BoolQ more than transfer from paraphrase or extractive QA data.
- Even BERT continues to benefit from an intermediate MultiNLI training stage before fine-tuning on BoolQ.
- Natural yes/no questions frequently require non-factoid information and entailment-style inference rather than direct span extraction.
- A performance gap of roughly ten points remains between the best model and human annotators.
Where Pith is reading between the lines
- Success on BoolQ would likely improve models on other realistic query types that mix reasoning with passage understanding.
- The pattern that entailment pre-training helps question answering may apply to additional tasks that hinge on implicit inference.
- Datasets built from unprompted user questions could expose similar gaps in other language-understanding benchmarks.
Load-bearing premise
The collected questions faithfully represent the distribution of yes/no questions that arise in everyday language use and that the provided answers contain little ambiguity or annotator bias.
What would settle it
A model trained without any entailment data that reaches or exceeds 90 percent accuracy on the BoolQ test set would undermine the claim that these natural questions systematically demand harder reasoning than current techniques can supply.
read the original abstract
In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BoolQ, a reading comprehension dataset of naturally occurring yes/no questions paired with passages from web sources. It argues that these questions are unexpectedly difficult, often requiring complex non-factoid inference akin to textual entailment. The authors evaluate transfer learning baselines and find that fine-tuning BERT after MultiNLI pre-training achieves 80.4% accuracy, compared to a 62% majority baseline and 90% human performance, leaving a substantial gap for future work.
Significance. If the dataset construction and labels are reliable, BoolQ provides a useful benchmark highlighting limitations of current models on natural yes/no questions even after strong pre-training. The empirical finding that entailment transfer outperforms paraphrase or extractive QA transfer is a concrete, actionable result that could guide future QA and NLI research. The work ships a new dataset with concrete accuracy numbers and baseline comparisons, which strengthens its contribution as an empirical resource.
major comments (3)
- [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.
- [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.
- [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.
minor comments (2)
- [Abstract] The abstract and introduction use 'surprising difficulty' without a quantitative comparison to prior yes/no QA datasets (e.g., on SQuAD or NewsQA yes/no subsets); adding this would better motivate the contribution.
- [Figure 1] Figure 1 (example questions) would benefit from explicit annotation of the inference steps required, to illustrate the 'entailment-like' nature claimed in the text.
Simulated Author's Rebuttal
We thank the referee for the valuable comments. We address each of the major comments below and have made revisions to the manuscript to incorporate additional details on dataset construction, experimental variance, and performance breakdowns.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.
Authors: We agree that inter-annotator agreement and details on the annotation process are important to establish label quality. Although not reported in the initial submission, we have now computed inter-annotator agreement on a sample of the data and included the statistics, the adjudication protocol, and an analysis of ambiguous cases in the revised manuscript. These additions show that agreement is high and ambiguous cases are few, supporting that the gap is not primarily due to label noise. revision: yes
-
Referee: [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.
Authors: We acknowledge that reporting variance and statistical significance would make the transfer learning results more convincing. We have conducted additional experiments across multiple random seeds and performed significance testing. The revised manuscript now includes the standard deviations and confirms that the improvement from MultiNLI pre-training is statistically significant. revision: yes
-
Referee: [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.
Authors: We agree that breaking down the results would help identify where models struggle. We have added such an analysis to the revised paper, including performance by question type and passage length. This breakdown reveals that the performance gap persists across different categories, indicating a broad challenge rather than localized issues. revision: yes
Circularity Check
No circularity: purely empirical dataset and evaluation
full rationale
The paper constructs a new reading-comprehension dataset BoolQ from naturally occurring yes/no questions and reports direct empirical accuracies (80.4% for the best BERT+MultiNLI transfer baseline versus 90% human and 62% majority). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; performance figures are measured on held-out test data after standard training, with no reduction to inputs by construction. Human annotation serves as an external benchmark rather than a self-referential definition. The work is therefore self-contained against external data and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotators provide reliable ground-truth labels for yes/no questions
Lean theorems connected to this paper
-
Foundation.RealityFromDistinctionreality_from_one_distinction unclearIn this paper we study yes/no questions that are naturally occurring — meaning that they are generated in unprompted and unconstrained settings.
Forward citations
Cited by 26 Pith papers
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Winner-Take-All Spiking Transformer for Language Modeling
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...
-
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
-
River-LLM: Large Language Model Seamless Exit Based on KV Share
River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving...
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
-
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Adaptive Spiking Neurons for Vision and Language Modeling
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA : V isual Q uestion A nswering. In Proceedings of the IEEE international conference on computer vision
work page 2015
-
[2]
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. T he S ixth PASCAL R ecognizing T extual E ntailment C hallenge. In TAC
work page 2009
-
[3]
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2011. T he S eventh PASCAL R ecognizing T extual E ntailment C hallenge. In TAC
work page 2011
-
[4]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A L arge A nnotated C orpus for L earning N atural L anguage I nference . In EMNLP
-
[5]
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. https://doi.org/10.18653/v1/P17-1152 E nhanced LSTM for N atural L anguage I nference . In ACL
-
[6]
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. https://www.aclweb.org/anthology/D18-1241 Q ua C : Q uestion A nswering in C ontext . In EMNLP
work page 2018
-
[7]
Alexis Conneau and Douwe Kiela. 2018. https://www.aclweb.org/anthology/L18-1269 S enteval: A n E valuation T oolkit for U niversal S entence R epresentations . In LREC
work page 2018
- [8]
-
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT : P re-training of D eep B idirectional T ransformers for L anguage U nderstanding . Computing Research Repository, arXiv:1810.04805. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. https://www.aclweb.org/anthology/P18-2103 B reaking NLI S ystems with S entences that R equire S imple L exical I nferences . In ACL
work page 2018
-
[11]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. https://doi.org/10.18653/v1/N18-2017 A nnotation A rtifacts in N atural L anguage I nference D ata . In NAACL
-
[12]
Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. R ead+ V erify: M achine R eading C omprehension with U nanswerable Q uestions. In CoRR
work page 2018
-
[13]
Robin Jia and Percy Liang. 2017. https://doi.org/10.18653/v1/D17-1215 A dversarial E xamples for E valuating R eading C omprehension S ystems . In EMNLP
-
[14]
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T riviaqa: A L arge S cale D istantly S upervised C hallenge D ataset for R eading C omprehension . In ACL
-
[15]
Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. S ci T ail: A T extual E ntailment D ataset from S cience Q uestion A nswering. In AAAI
work page 2018
-
[16]
Diederik P Kingma and Jimmy Ba. 2014. A dam: A M ethod for S tochastic O ptimization. In ICLR
work page 2014
-
[17]
Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. N atural Q uestions: a B enchmark for Q uestion A nswering R esea...
work page 2019
-
[18]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 R ace: L arge- S cale R eading C omprehension D ataset from E xaminations . In EMNLP
-
[19]
R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://arxiv.org/abs/1902.01007 R ight for the W rong R easons: D iagnosing S yntactic H euristics in N atural L anguage I nference . Computing Research Repository, arXiv:1902.01007. Version 1
work page Pith review arXiv 2019
-
[20]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://www.aclweb.org/anthology/D18-1260 C an a S uit of A rmor C onduct E lectricity? A N ew D ataset for O pen B ook Q uestion A nswering . In EMNLP
work page 2018
-
[21]
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. A dvances in P re- T raining D istributed W ord R epresentations. In LREC
work page 2018
-
[22]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO : A H uman G enerated M achine R eading C omprehension D ataset . Computing Research Repository, arXiv:1611.09268. Version 3
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Ankur P Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A D ecomposable A ttention M odel for N atural L anguage I nference . In EMNLP
-
[24]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 D eep C ontextualized W ord R epresentations . In NAACL
- [25]
-
[26]
Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. https://www.aclweb.org/anthology/D18-1007 C ollecting D iverse N atural L anguage I nference P roblems for S entence R epresentation E valuation . In EMNLP
work page 2018
-
[27]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf I mproving L anguage U nderstanding by G enerative P re-training
work page 2018
-
[28]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://www.aclweb.org/anthology/P18-2124 K now W hat Y ou D on't K now: U nanswerable Q uestions for SQuAD . In ACL
work page 2018
-
[29]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 S quad: 100,000+ Q uestions for M achine C omprehension of T ext . In EMNLP
-
[30]
Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. C o QA : A C onversational Q uestion A nswering C hallenge. In TACL
work page 2018
-
[31]
Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/D18-1233 I nterpretation of N atural L anguage R ules in C onversational M achine R eading . In EMNLP
work page 2018
-
[32]
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. B idirectional A ttention F low for M achine C omprehension. In ICLR
work page 2017
-
[33]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://www.aclweb.org/anthology/W18-5446 GLUE : A M ulti- T ask B enchmark and A nalysis P latform for N atural L anguage U nderstanding . In EMNLP
work page 2018
-
[34]
Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/Q18-1021 C onstructing D atasets for M ulti-hop R eading C omprehension A cross D ocuments . In ACL
work page 2018
-
[35]
Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri \"e nboer, Armand Joulin, and Tomas Mikolov. 2015. T owards AI - C omplete Q uestion A nswering: A S et of P rerequisite T oy T asks. In ICLR
work page 2015
-
[36]
Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A B road- C overage C hallenge C orpus for S entence U nderstanding through I nference . In NAACL
-
[37]
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual Q uestion A nswering: A S urvey of M ethods and D atasets. In Computer Vision and Image Understanding. Elsevier
work page 2017
-
[38]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. https://www.aclweb.org/anthology/D18-1259 H otpotqa: A D ataset for D iverse, E xplainable M ulti-hop Q uestion A nswering . In EMNLP
work page 2018
-
[39]
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. https://www.aclweb.org/anthology/D18-1009 S wag: A L arge- S cale A dversarial D ataset for G rounded C ommonsense I nference . In EMNLP
work page 2018
-
[40]
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 R e C o R D : B ridging the G ap between H uman and M achine C ommonsense R eading C omprehension . Computing Research Repository, arXiv:1810.12885. Version 1
work page Pith review arXiv 2018
-
[41]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning B ooks and M ovies: T owards S tory- L ike V isual E xplanations by W atching M ovies and R eading B ooks. In Proceedings of the IEEE international conference on computer vision, pages 19--27
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.