Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Ashish Sabharwal; Peter Clark; Todor Mihaylov; Tushar Khot

arxiv: 1809.02789 · v1 · submitted 2018-09-08 · 💻 cs.CL

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov , Peter Clark , Tushar Khot , Ashish Sabharwal This is my paper

Pith reviewed 2026-05-13 08:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords question answeringopen book QAcommon knowledgescience factsdatasetmulti-hop reasoningpre-trained modelsbaselines

0 comments

The pith

Many state-of-the-art pre-trained QA methods perform worse than simple neural baselines on questions that combine science facts with common knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenBookQA, a dataset modeled on open-book exams that supplies 1329 elementary science facts and pairs them with roughly 6000 questions. Each question requires retrieving one fact and applying it to a new situation using everyday knowledge that is not stated in the fact itself. Humans reach near 92 percent accuracy, yet many advanced pre-trained QA systems score lower than basic neural models built for the task. Oracle tests that supply the correct fact show that both the provided knowledge and additional common-sense facts matter. The work frames retrieval across this multi-hop setting as the central unsolved problem.

Core claim

OpenBookQA requires a model to select the right fact from a small open book and combine it with external common knowledge to answer questions about novel situations. Human solvers achieve close to 92 percent accuracy, but many state-of-the-art pre-trained QA methods perform surprisingly poorly and fall below several simple neural baselines developed in the paper. Oracle experiments that remove the retrieval step demonstrate the value of both the open-book facts and the additional common-knowledge facts.

What carries the argument

The OpenBookQA dataset, which supplies a compact set of science facts and forces models to retrieve one fact and integrate it with unstated common knowledge.

If this is right

Pre-trained QA systems have a measurable deficit when forced to integrate retrieved facts with external common knowledge.
Simple neural baselines remain competitive and sometimes superior on this style of question.
Supplying the correct fact in an oracle setting lifts performance, confirming that both the fact and the additional knowledge are load-bearing.
Solving multi-hop retrieval over a small knowledge base plus outside facts is the main remaining obstacle to human-level results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may persist even with larger pre-training unless models gain explicit mechanisms for pulling in unstated facts.
The same open-book-plus-common-knowledge format could be applied to other subjects to test whether current methods generalize beyond pattern matching.
Small, curated fact sets paired with targeted questions may expose reasoning limits that large unstructured corpora obscure.

Load-bearing premise

The questions cannot be solved by linguistic patterns or surface cues alone and genuinely require combining the stated fact with outside common knowledge.

What would settle it

A model that reaches near-human accuracy while denied access to the open-book facts or while relying only on question wording would show the dataset does not test the intended integration.

read the original abstract

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic---in the context of common knowledge---and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces OpenBookQA, a new QA dataset modeled on open-book exams, consisting of 1329 elementary science facts and approximately 6000 multiple-choice questions. The questions are designed to require combining a provided fact with external common knowledge. The paper reports that state-of-the-art pre-trained QA models perform poorly on this dataset, underperforming several simple neural baselines developed by the authors, while humans reach ~92% accuracy. Oracle experiments that supply the relevant facts demonstrate their value and highlight the retrieval challenge.

Significance. If the questions genuinely require multi-hop integration of the open-book facts with common knowledge, this dataset provides a valuable benchmark for advancing QA systems beyond pattern matching toward deeper reasoning. The release of the facts, questions, and baselines is a concrete contribution that can be used immediately by the community.

major comments (1)

[Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.

minor comments (2)

[Dataset] In the dataset construction section, the process for ensuring that each question requires the specific open-book fact (rather than being answerable from the question text alone) could be described more explicitly, including any filtering steps applied after crowdsourcing.
[Introduction] Figure 1 (example question) would benefit from an additional row showing the model predictions of the simple baselines versus the SOTA systems to illustrate the performance gap visually.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The suggestion to include an explicit cue-only baseline is valuable for strengthening the interpretation of our results, and we will revise the manuscript to address this.

read point-by-point responses

Referee: [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.

Authors: We agree that this control experiment is important for isolating the contribution of knowledge integration. In the revised manuscript, we will add results for all models (including the SOTA pre-trained QA systems and our simple neural baselines) when trained and evaluated on question text plus answer choices only, with no facts from the open book provided. This will allow us to quantify the performance attributable to surface patterns or annotation artifacts versus the need to combine the open-book facts with common knowledge. We will also update the discussion and oracle analysis sections to reference these new numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction and benchmarking

full rationale

The paper introduces the OpenBookQA dataset and reports direct empirical evaluations of QA methods against it, human performance, and simple baselines. No equations, parameter fittings, derivations, or self-citations form any load-bearing chain that reduces results to inputs by construction. Claims rest on new data collection and standard accuracy measurements, which are externally verifiable and independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the facts are elementary and accurate, with no free parameters or new entities introduced.

axioms (1)

domain assumption The provided elementary science facts are accurate and sufficient when combined with common knowledge.
The dataset is built on these facts being correct.

pith-pipeline@v0.9.0 · 5517 in / 1137 out tokens · 52225 ms · 2026-05-13T08:10:08.311490+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs
cs.CR 2025-11 conditional novelty 8.0

CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without th...
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Beyond Prediction: Tail-Aware Scheduling for LLM Inference
cs.LG 2026-06 unverdicted novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
cs.LG 2026-06 conditional novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as...
Parameter-Efficient Fine-Tuning with Learnable Rank
cs.CL 2026-06 unverdicted novelty 7.0

LR-LoRA learns per-layer adapter ranks during training and reports outperforming fixed-rank LoRA and other PEFT baselines on language understanding and commonsense reasoning tasks.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
cs.DC 2026-06 unverdicted novelty 7.0

A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
cs.CL 2026-05 unverdicted novelty 7.0

D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
cs.LG 2026-03 conditional novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
Deep Delta Learning
cs.LG 2026-01 unverdicted novelty 7.0

Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accu...
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
cs.LG 2025-07 unverdicted novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
PRIMETIME : Limits of LLMs in Temporal Primitives
cs.NE 2025-04 unverdicted novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Federated Co-tuning Framework for Large and Small Language Models
cs.CL 2024-11 unverdicted novelty 7.0

FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
SpinQuant: LLM quantization with learned rotations
cs.LG 2024-05 conditional novelty 7.0

SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
cs.CL 2024-02 unverdicted novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
cs.LG 2026-06 unverdicted novelty 6.0

One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
cs.IR 2026-06 unverdicted novelty 6.0

R2LM combines causal attention with a reverse Mamba SSM sidecar to supply right-side context in dLLMs, claiming 2.4x-12.9x throughput gains over bidirectional dLLMs and 1.9x-2.9x over AR baselines while matching or ex...
BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training
cs.LG 2026-06 unverdicted novelty 6.0

BLADE converts influence-based bi-level data selection into a Hessian-free penalized objective with a dynamic reference model, proves first-order convergence, and reports better performance than prior methods on LLM training.
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
cs.CL 2026-06 unverdicted novelty 6.0

LC-QAT achieves data-efficient 2-bit weight-only QAT for LLMs by representing quantized weights as a learned affine transform over discrete vectors, supporting end-to-end optimization from a high-quality PTQ start.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
cs.LG 2026-06 unverdicted novelty 6.0

LiftQuant enables continuous bit-width LLM quantization via dimensional lifting and projection from a 1-bit lattice, allowing 2.4-bit compression of 70B models that outperforms fixed 2-bit baselines on identical hardware.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
cs.LG 2026-06 unverdicted novelty 6.0

LiftQuant uses dimensional lifting of weights to a higher-dimensional 1-bit lattice followed by projection to achieve tunable continuous bit-widths in LLM quantization while remaining hardware-friendly.
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
cs.CL 2026-06 unverdicted novelty 6.0

Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Towards Efficient LLMs Annealing with Principled Sample Selection
cs.CL 2026-05 unverdicted novelty 6.0

DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
cs.LG 2026-05 unverdicted novelty 6.0

Mixture of Activations mixes activation functions token-adaptively in FFNs via lightweight gates, strictly more expressive than fixed or learnable activations, and yields lower pretraining loss from 0.12B to 2B models.
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

BitsMoE uses SVD decomposition and activation-aware ILP bit allocation to quantize MoE LLMs at ultra-low bits with reduced accuracy degradation compared to GPTQ.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
cs.LG 2026-05 conditional novelty 6.0

Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
cs.CL 2026-04 unverdicted novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
cs.NE 2026-04 unverdicted novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
cs.LG 2026-03 unverdicted novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization
cs.LG 2025-11 unverdicted novelty 6.0

SpecQuant uses outlier smoothing into weights followed by channel-wise low-frequency Fourier truncation to achieve 4-bit quantization of LLaMA-3 8B with only 1.5% zero-shot accuracy loss versus full precision.
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
cs.LG 2025-10 unverdicted novelty 6.0

ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B...
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
cs.LG 2025-10 unverdicted novelty 6.0

A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
HyperAdapt: Simple High-Rank Adaptation
cs.LG 2025-09 unverdicted novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
cs.CL 2025-09 unverdicted novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
cs.LG 2025-03 conditional novelty 6.0

Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
LaMI: Augmenting Large Language Models via Late Multi-Image Fusion
cs.CL 2024-06 unverdicted novelty 6.0

LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP per...
An Empirical Study of Mamba-based Language Models
cs.LG 2024-06 accept novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 80 Pith papers · 1 internal anchor

[1]

Banko, M

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. In IJCAI

work page 2007
[2]

D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358--2367

work page 2016
[3]

D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017 a . Reading wikipedia to answer open-domain questions. In ACL

work page 2017
[4]

Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017 b . Enhanced lstm for natural language inference. In ACL, pages 1657--1668

work page 2017
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. CoRR, abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Clark, O

P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580--2586

work page 2016
[7]

Conneau, D

A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670--680

work page 2017
[8]

AllenNLP: A Deep Semantic Natural Language Processing Platform

M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP : A deep semantic natural language processing platform. CoRR, abs/1803.07640

work page Pith review arXiv 2017
[9]

Gururangan, S

S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL

work page 2018
[10]

K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693--1701

work page 2015
[11]

F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children's books with explicit memory representations. In ICLR

work page 2016
[12]

Hoeffding

W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13--30

work page 1963
[13]

Jansen, N

P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. 2016. What's in an explanation? characterizing knowledge and inference requirements for elementary science exams. In COLING

work page 2016
[14]

P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree : A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC

work page 2018
[15]

T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds

work page 1995
[16]

Joshi, E

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601--1611

work page 2017
[17]

Kembhavi, M

A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376--5384

work page 2017
[18]

Khashabi, S

D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL

work page 2018
[19]

Khashabi, T

D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI

work page 2016
[20]

T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL

work page 2017
[21]

T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail : A textual entailment dataset from science question answering. In AAAI

work page 2018
[22]

D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization . International Conference on Learning Representations 2015, pages 1--15

work page 2015
[23]

The NarrativeQA Reading Comprehension Challenge

T. Kocisk \' y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040

work page Pith review arXiv 2017
[24]

Landsberger

J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm

work page 1996
[25]

Mihaylov and A

T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100--107

work page 2016
[26]

Mihaylov and A

T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination . In LSDSem – Shared Task

work page 2017
[27]

Mihaylov and A

T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge . In ACL, pages 821--832

work page 2018
[28]

Mihaylov and P

T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3 : Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval '16

work page 2016
[29]

G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41

work page 1995
[30]

G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet : A n on-line lexical database. International Journal of Lexicography, 3(4):235--244

work page 1990
[31]

B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL

work page 2018
[32]

Mostafazadeh, N

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories . In NAACL

work page 2016
[33]

Nakov, L

P. Nakov, L. M \`a rquez, A. Moschitti, W. Magdy, H. Mubarak, a. A. Freihat, J. Glass, and B. Randeree. 2016. Semeval-2016 task 3: Community question answering. In SemEval '16, pages 525--545

work page 2016
[34]

Onishi, H

T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In EMNLP, pages 2230--2235, Austin, Texas

work page 2016
[35]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017
[36]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn : M achine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

work page 2011
[37]

Pennington, R

J. Pennington, R. Socher, and C. Manning. 2014. GloVe : G lobal vectors for word representation. In EMNLP, pages 1532--1543

work page 2014
[38]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL

work page 2018
[39]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383--2392

work page 2016
[40]

Richardson, C

M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest : A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193--203

work page 2013
[41]

Singh, T

P. Singh, T. Lin, E. Mueller, G. Lim, T. Perkins, and W. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In Lecture Notes in Computer Science, volume 2519, pages 1223--1237

work page 2002
[42]

Speer, J

R. Speer, J. Chin, and C. Havasi. 2017. ConceptNet 5.5 : A n open multilingual graph of general knowledge. In AAAI

work page 2017
[43]

Stasaski and M

K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications

work page 2017
[44]

Sugawara, H

S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-perspective analysis of mctest datasets and systems. In AAAI, pages 3089--3096

work page 2017
[45]

Trischler, T

A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191--200

work page 2017
[46]

P. D. Turney. 2017. Leveraging term banks for answering complex questions: A case for sparse vectors. CoRR, abs/1704.03543

work page arXiv 2017
[47]

Weissenborn, G

D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271--280

work page 2017
[48]

Welbl, P

J. Welbl, P. Stenetorp, and S. Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL

work page 2018
[49]

Zhang, H

Y. Zhang, H. Dai, K. Toraman, and L. Song. 2018. KG \^ 2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings . In arXiv

work page 2018

[1] [1]

Banko, M

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. In IJCAI

work page 2007

[2] [2]

D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358--2367

work page 2016

[3] [3]

D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017 a . Reading wikipedia to answer open-domain questions. In ACL

work page 2017

[4] [4]

Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017 b . Enhanced lstm for natural language inference. In ACL, pages 1657--1668

work page 2017

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. CoRR, abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Clark, O

P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580--2586

work page 2016

[7] [7]

Conneau, D

A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670--680

work page 2017

[8] [8]

AllenNLP: A Deep Semantic Natural Language Processing Platform

M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP : A deep semantic natural language processing platform. CoRR, abs/1803.07640

work page Pith review arXiv 2017

[9] [9]

Gururangan, S

S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL

work page 2018

[10] [10]

K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693--1701

work page 2015

[11] [11]

F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children's books with explicit memory representations. In ICLR

work page 2016

[12] [12]

Hoeffding

W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13--30

work page 1963

[13] [13]

Jansen, N

P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. 2016. What's in an explanation? characterizing knowledge and inference requirements for elementary science exams. In COLING

work page 2016

[14] [14]

P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree : A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC

work page 2018

[15] [15]

T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds

work page 1995

[16] [16]

Joshi, E

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601--1611

work page 2017

[17] [17]

Kembhavi, M

A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376--5384

work page 2017

[18] [18]

Khashabi, S

D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL

work page 2018

[19] [19]

Khashabi, T

D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI

work page 2016

[20] [20]

T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL

work page 2017

[21] [21]

T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail : A textual entailment dataset from science question answering. In AAAI

work page 2018

[22] [22]

D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization . International Conference on Learning Representations 2015, pages 1--15

work page 2015

[23] [23]

The NarrativeQA Reading Comprehension Challenge

T. Kocisk \' y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040

work page Pith review arXiv 2017

[24] [24]

Landsberger

J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm

work page 1996

[25] [25]

Mihaylov and A

T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100--107

work page 2016

[26] [26]

Mihaylov and A

T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination . In LSDSem – Shared Task

work page 2017

[27] [27]

Mihaylov and A

T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge . In ACL, pages 821--832

work page 2018

[28] [28]

Mihaylov and P

T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3 : Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval '16

work page 2016

[29] [29]

G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41

work page 1995

[30] [30]

G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet : A n on-line lexical database. International Journal of Lexicography, 3(4):235--244

work page 1990

[31] [31]

B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL

work page 2018

[32] [32]

Mostafazadeh, N

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories . In NAACL

work page 2016

[33] [33]

Nakov, L

P. Nakov, L. M \`a rquez, A. Moschitti, W. Magdy, H. Mubarak, a. A. Freihat, J. Glass, and B. Randeree. 2016. Semeval-2016 task 3: Community question answering. In SemEval '16, pages 525--545

work page 2016

[34] [34]

Onishi, H

T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In EMNLP, pages 2230--2235, Austin, Texas

work page 2016

[35] [35]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017

[36] [36]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn : M achine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

work page 2011

[37] [37]

Pennington, R

J. Pennington, R. Socher, and C. Manning. 2014. GloVe : G lobal vectors for word representation. In EMNLP, pages 1532--1543

work page 2014

[38] [38]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL

work page 2018

[39] [39]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383--2392

work page 2016

[40] [40]

Richardson, C

M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest : A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193--203

work page 2013

[41] [41]

Singh, T

P. Singh, T. Lin, E. Mueller, G. Lim, T. Perkins, and W. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In Lecture Notes in Computer Science, volume 2519, pages 1223--1237

work page 2002

[42] [42]

Speer, J

R. Speer, J. Chin, and C. Havasi. 2017. ConceptNet 5.5 : A n open multilingual graph of general knowledge. In AAAI

work page 2017

[43] [43]

Stasaski and M

K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications

work page 2017

[44] [44]

Sugawara, H

S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-perspective analysis of mctest datasets and systems. In AAAI, pages 3089--3096

work page 2017

[45] [45]

Trischler, T

A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191--200

work page 2017

[46] [46]

P. D. Turney. 2017. Leveraging term banks for answering complex questions: A case for sparse vectors. CoRR, abs/1704.03543

work page arXiv 2017

[47] [47]

Weissenborn, G

D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271--280

work page 2017

[48] [48]

Welbl, P

J. Welbl, P. Stenetorp, and S. Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL

work page 2018

[49] [49]

Zhang, H

Y. Zhang, H. Dai, K. Toraman, and L. Song. 2018. KG \^ 2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings . In arXiv

work page 2018