hub

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang · 2016 · cs.CL · arXiv 1606.05250

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open full Pith review browse 35 citing papers arXiv PDF

abstract

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) i

co-cited works

representative citing papers

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

Passage Re-ranking with BERT

cs.IR · 2019-01-13 · unverdicted · novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

cs.CR · 2026-05-07 · unverdicted · novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.

Multitask Prompted Training Enables Zero-Shot Task Generalization

cs.LG · 2021-10-15 · conditional · novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

cs.CL · 2019-10-29 · accept · novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

cs.CL · 2016-11-28 · accept · novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

cs.LG · 2026-04-26 · conditional · novelty 6.0 · 2 refs

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.

Are Large Language Models Economically Viable for Industry Deployment?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoMo10 benchmark.

Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.

Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

cs.DC · 2026-04-10 · unverdicted · novelty 6.0

Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.

GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

cs.IR · 2026-04-07 · accept · novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

citing papers explorer

Showing 35 of 35 citing papers.

Online Learning-to-Defer with Varying Experts stat.ML · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Passage Re-ranking with BERT cs.IR · 2019-01-13 · unverdicted · none · ref 10 · internal anchor
Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts cs.CR · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations cs.LG · 2026-05-04 · unverdicted · none · ref 154 · internal anchor
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation cs.CL · 2026-04-12 · unverdicted · none · ref 4 · internal anchor
Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.
Multitask Prompted Training Enables Zero-Shot Task Generalization cs.LG · 2021-10-15 · conditional · none · ref 46 · internal anchor
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 27 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension cs.CL · 2019-10-29 · accept · none · ref 18 · internal anchor
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 58 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset cs.CL · 2016-11-28 · accept · none · ref 12 · internal anchor
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · unverdicted · none · ref 71 · internal anchor
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls cs.CL · 2026-05-04 · unverdicted · none · ref 87 · internal anchor
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation cs.LG · 2026-04-26 · conditional · none · ref 30 · 2 links · internal anchor
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
Are Large Language Models Economically Viable for Industry Deployment? cs.CL · 2026-04-21 · unverdicted · none · ref 56 · internal anchor
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents cs.CL · 2026-04-20 · unverdicted · none · ref 10 · internal anchor
HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoMo10 benchmark.
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents cs.CL · 2026-04-20 · unverdicted · none · ref 17 · internal anchor
DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs cs.CL · 2026-04-14 · unverdicted · none · ref 54 · internal anchor
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation cs.CV · 2026-04-10 · unverdicted · none · ref 78 · internal anchor
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning cs.LG · 2026-04-10 · unverdicted · none · ref 115 · internal anchor
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures cs.DC · 2026-04-10 · unverdicted · none · ref 40 · internal anchor
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections cs.IR · 2026-04-07 · accept · none · ref 23 · internal anchor
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning cs.AI · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 143 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 168 · internal anchor
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Textbooks Are All You Need II: phi-1.5 technical report cs.CL · 2023-09-11 · unverdicted · none · ref 20 · internal anchor
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 187 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 94 · internal anchor
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices cs.AR · 2026-04-26 · conditional · none · ref 9 · internal anchor
Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 73 · internal anchor
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 45 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators cs.AR · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CL · 2026-01-08 · unverdicted · none · ref 20 · internal anchor
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 6 · internal anchor
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer