Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Pith reviewed 2026-05-13 08:10 UTC · model grok-4.3
The pith
Many state-of-the-art pre-trained QA methods perform worse than simple neural baselines on questions that combine science facts with common knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenBookQA requires a model to select the right fact from a small open book and combine it with external common knowledge to answer questions about novel situations. Human solvers achieve close to 92 percent accuracy, but many state-of-the-art pre-trained QA methods perform surprisingly poorly and fall below several simple neural baselines developed in the paper. Oracle experiments that remove the retrieval step demonstrate the value of both the open-book facts and the additional common-knowledge facts.
What carries the argument
The OpenBookQA dataset, which supplies a compact set of science facts and forces models to retrieve one fact and integrate it with unstated common knowledge.
If this is right
- Pre-trained QA systems have a measurable deficit when forced to integrate retrieved facts with external common knowledge.
- Simple neural baselines remain competitive and sometimes superior on this style of question.
- Supplying the correct fact in an oracle setting lifts performance, confirming that both the fact and the additional knowledge are load-bearing.
- Solving multi-hop retrieval over a small knowledge base plus outside facts is the main remaining obstacle to human-level results.
Where Pith is reading between the lines
- The gap may persist even with larger pre-training unless models gain explicit mechanisms for pulling in unstated facts.
- The same open-book-plus-common-knowledge format could be applied to other subjects to test whether current methods generalize beyond pattern matching.
- Small, curated fact sets paired with targeted questions may expose reasoning limits that large unstructured corpora obscure.
Load-bearing premise
The questions cannot be solved by linguistic patterns or surface cues alone and genuinely require combining the stated fact with outside common knowledge.
What would settle it
A model that reaches near-human accuracy while denied access to the open-book facts or while relying only on question wording would show the dataset does not test the intended integration.
read the original abstract
We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic---in the context of common knowledge---and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenBookQA, a new QA dataset modeled on open-book exams, consisting of 1329 elementary science facts and approximately 6000 multiple-choice questions. The questions are designed to require combining a provided fact with external common knowledge. The paper reports that state-of-the-art pre-trained QA models perform poorly on this dataset, underperforming several simple neural baselines developed by the authors, while humans reach ~92% accuracy. Oracle experiments that supply the relevant facts demonstrate their value and highlight the retrieval challenge.
Significance. If the questions genuinely require multi-hop integration of the open-book facts with common knowledge, this dataset provides a valuable benchmark for advancing QA systems beyond pattern matching toward deeper reasoning. The release of the facts, questions, and baselines is a concrete contribution that can be used immediately by the community.
major comments (1)
- [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.
minor comments (2)
- [Dataset] In the dataset construction section, the process for ensuring that each question requires the specific open-book fact (rather than being answerable from the question text alone) could be described more explicitly, including any filtering steps applied after crowdsourcing.
- [Introduction] Figure 1 (example question) would benefit from an additional row showing the model predictions of the simple baselines versus the SOTA systems to illustrate the performance gap visually.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The suggestion to include an explicit cue-only baseline is valuable for strengthening the interpretation of our results, and we will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.
Authors: We agree that this control experiment is important for isolating the contribution of knowledge integration. In the revised manuscript, we will add results for all models (including the SOTA pre-trained QA systems and our simple neural baselines) when trained and evaluated on question text plus answer choices only, with no facts from the open book provided. This will allow us to quantify the performance attributable to surface patterns or annotation artifacts versus the need to combine the open-book facts with common knowledge. We will also update the discussion and oracle analysis sections to reference these new numbers. revision: yes
Circularity Check
No circularity: empirical dataset introduction and benchmarking
full rationale
The paper introduces the OpenBookQA dataset and reports direct empirical evaluations of QA methods against it, human performance, and simple baselines. No equations, parameter fittings, derivations, or self-citations form any load-bearing chain that reduces results to inputs by construction. Claims rest on new data collection and standard accuracy measurements, which are externally verifiable and independent of the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The provided elementary science facts are accurate and sufficient when combined with common knowledge.
Forward citations
Cited by 60 Pith papers
-
CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs
CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without th...
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Beyond Prediction: Tail-Aware Scheduling for LLM Inference
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
-
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as...
-
Parameter-Efficient Fine-Tuning with Learnable Rank
LR-LoRA learns per-layer adapter ranks during training and reports outperforming fixed-rank LoRA and other PEFT baselines on language understanding and commonsense reasoning tasks.
-
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
-
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
Winner-Take-All Spiking Transformer for Language Modeling
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
-
Deep Delta Learning
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accu...
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
PRIMETIME : Limits of LLMs in Temporal Primitives
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
-
Federated Co-tuning Framework for Large and Small Language Models
FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
SpinQuant: LLM quantization with learned rotations
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
-
Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
R2LM combines causal attention with a reverse Mamba SSM sidecar to supply right-side context in dLLMs, claiming 2.4x-12.9x throughput gains over bidirectional dLLMs and 1.9x-2.9x over AR baselines while matching or ex...
-
BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training
BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.
-
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
LC-QAT achieves data-efficient 2-bit weight-only QAT for LLMs by representing quantized weights as a learned affine transform over discrete vectors, supporting end-to-end optimization from a high-quality PTQ start.
-
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
LiftQuant enables continuous bit-width LLM quantization via dimensional lifting and projection from a 1-bit lattice, allowing 2.4-bit compression of 70B models that outperforms fixed 2-bit baselines on identical hardware.
-
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
LiftQuant uses dimensional lifting of weights to a higher-dimensional 1-bit lattice followed by projection to achieve tunable continuous bit-widths in LLM quantization while remaining hardware-friendly.
-
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
-
Towards Efficient LLMs Annealing with Principled Sample Selection
DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.
-
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Mixture of Activations mixes activation functions token-adaptively in FFNs via lightweight gates, strictly more expressive than fixed or learnable activations, and yields lower pretraining loss from 0.12B to 2B models.
-
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
BitsMoE uses SVD decomposition and activation-aware ILP bit allocation to quantize MoE LLMs at ultra-low bits with reduced accuracy degradation compared to GPTQ.
-
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
-
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization
SpecQuant uses outlier smoothing into weights followed by channel-wise low-frequency Fourier truncation to achieve 4-bit quantization of LLaMA-3 8B with only 1.5% zero-shot accuracy loss versus full precision.
-
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B...
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
HyperAdapt: Simple High-Rank Adaptation
HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
-
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
-
LaMI: Augmenting Large Language Models via Late Multi-Image Fusion
LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP per...
-
An Empirical Study of Mamba-based Language Models
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Reference graph
Works this paper leans on
- [1]
-
[2]
D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358--2367
work page 2016
-
[3]
D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017 a . Reading wikipedia to answer open-domain questions. In ACL
work page 2017
-
[4]
Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017 b . Enhanced lstm for natural language inference. In ACL, pages 1657--1668
work page 2017
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. CoRR, abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [6]
-
[7]
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670--680
work page 2017
-
[8]
AllenNLP: A Deep Semantic Natural Language Processing Platform
M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP : A deep semantic natural language processing platform. CoRR, abs/1803.07640
work page Pith review arXiv 2017
-
[9]
S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL
work page 2018
-
[10]
K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693--1701
work page 2015
-
[11]
F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children's books with explicit memory representations. In ICLR
work page 2016
- [12]
- [13]
-
[14]
P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree : A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC
work page 2018
-
[15]
T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds
work page 1995
- [16]
-
[17]
A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376--5384
work page 2017
-
[18]
D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL
work page 2018
-
[19]
D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI
work page 2016
-
[20]
T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL
work page 2017
-
[21]
T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail : A textual entailment dataset from science question answering. In AAAI
work page 2018
-
[22]
D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization . International Conference on Learning Representations 2015, pages 1--15
work page 2015
-
[23]
The NarrativeQA Reading Comprehension Challenge
T. Kocisk \' y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040
work page Pith review arXiv 2017
-
[24]
J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm
work page 1996
-
[25]
T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100--107
work page 2016
-
[26]
T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination . In LSDSem – Shared Task
work page 2017
-
[27]
T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge . In ACL, pages 821--832
work page 2018
-
[28]
T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3 : Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval '16
work page 2016
-
[29]
G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41
work page 1995
-
[30]
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet : A n on-line lexical database. International Journal of Lexicography, 3(4):235--244
work page 1990
-
[31]
B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL
work page 2018
-
[32]
N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories . In NAACL
work page 2016
- [33]
- [34]
- [35]
-
[36]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn : M achine learning in P ython. Journal of Machine Learning Research, 12:2825--2830
work page 2011
-
[37]
J. Pennington, R. Socher, and C. Manning. 2014. GloVe : G lobal vectors for word representation. In EMNLP, pages 1532--1543
work page 2014
-
[38]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL
work page 2018
-
[39]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383--2392
work page 2016
-
[40]
M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest : A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193--203
work page 2013
- [41]
- [42]
-
[43]
K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications
work page 2017
-
[44]
S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-perspective analysis of mctest datasets and systems. In AAAI, pages 3089--3096
work page 2017
-
[45]
A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191--200
work page 2017
- [46]
-
[47]
D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271--280
work page 2017
- [48]
- [49]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.