hub Mixed citations

SQ u AD : 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang · 2016 · Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/d16-1264

Mixed citation behavior. Most common role is dataset (43%).

59 Pith papers citing it

2,627 external citations · Crossref

Dataset 43% of classified citations

open at publisher browse 59 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 dataset 3 method 1

citation-polarity summary

use dataset 3 background 2 unclear 1 use method 1

representative citing papers

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

cs.CY · 2026-02-19 · accept · novelty 8.0

A survey of 172 open educational datasets from 204 papers across LAK, EDM, and AIED conferences reveals trends, 143 previously uncatalogued datasets, field gaps, and an 8-item PRACTICE checklist for better data publication.

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

cs.AI · 2021-08-31 · accept · novelty 8.0

MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

cs.IT · 2026-05-01 · unverdicted · novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

cs.CL · 2026-04-25 · conditional · novelty 7.0 · 2 refs

A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.

Sampling from Your Language Model One Byte at a Time

cs.CL · 2025-06-17 · unverdicted · novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

cs.CL · 2021-10-04 · unverdicted · novelty 7.0

Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.

The Power of Scale for Parameter-Efficient Prompt Tuning

cs.CL · 2021-04-18 · unverdicted · novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

cs.CL · 2019-05-24 · accept · novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Soft-prompt tuning with 10 vectors improves format compliance on LLM benchmarks and provides a low-cost proxy for comparing base models.

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.

Boosting Self-Consistency with Ranking

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.

citing papers explorer

Showing 50 of 59 citing papers.

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE cs.CY · 2026-02-19 · accept · none · ref 269
A survey of 172 open educational datasets from 204 papers across LAK, EDM, and AIED conferences reveals trends, 143 previously uncatalogued datasets, field gaps, and an 8-item PRACTICE checklist for better data publication.
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics cs.AI · 2021-08-31 · accept · none · ref 11
MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.
RoFormer: Enhanced Transformer with Rotary Position Embedding cs.CL · 2021-04-20 · accept · none · ref 14
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving cs.AI · 2026-06-04 · unverdicted · none · ref 49
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 37
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints cs.CL · 2026-05-09 · unverdicted · none · ref 20
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference cs.LG · 2026-05-05 · unverdicted · none · ref 21
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations cs.LG · 2026-05-04 · unverdicted · none · ref 112
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass cs.IT · 2026-05-01 · unverdicted · none · ref 69
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · conditional · none · ref 55 · 2 links
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
Sampling from Your Language Model One Byte at a Time cs.CL · 2025-06-17 · unverdicted · none · ref 57
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 261
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 101
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 189
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA cs.CL · 2021-10-04 · unverdicted · none · ref 28
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
The Power of Scale for Parameter-Efficient Prompt Tuning cs.CL · 2021-04-18 · unverdicted · none · ref 79
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 29
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions cs.CL · 2019-05-24 · accept · none · ref 29
BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection cs.AI · 2026-06-17 · unverdicted · none · ref 42
ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.
Redesign Mixture-of-Experts Routers with Manifold Power Iteration cs.LG · 2026-06-10 · unverdicted · none · ref 40
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation cs.CL · 2026-06-10 · unverdicted · none · ref 41
Soft-prompt tuning with 10 vectors improves format compliance on LLM benchmarks and provides a low-cost proxy for comparing base models.
Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs cs.CL · 2026-06-06 · unverdicted · none · ref 30
TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions cs.CL · 2026-06-04 · unverdicted · none · ref 14
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
Boosting Self-Consistency with Ranking cs.CL · 2026-06-03 · unverdicted · none · ref 160
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain cs.AI · 2026-06-02 · unverdicted · none · ref 6
InfoMem is an answer-conditioned information gain reward for RL training of long-context memory agents that improves performance when applied to successful trajectories and normalized.
Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents cs.AI · 2026-05-29 · unverdicted · none · ref 3
Introduces CoSee auditing framework and identifies Noise Reinforcement and Policy Collapse as dominant failure modes when weak 4B-8B models use shared state for multi-page visual QA.
Graph Alignment Topology as an Inductive Bias for Grounding Detection cs.CL · 2026-05-21 · unverdicted · none · ref 31
A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.
Predictive Prefetching for Retrieval-Augmented Generation cs.CL · 2026-05-18 · unverdicted · none · ref 40
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head cs.CL · 2026-05-12 · unverdicted · none · ref 31
PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference cs.CR · 2026-05-06 · conditional · none · ref 65
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching cs.AI · 2026-04-16 · unverdicted · none · ref 28
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation cs.CL · 2025-07-20 · unverdicted · none · ref 20
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Should We Still Pretrain Encoders with Masked Language Modeling? cs.CL · 2025-07-01 · accept · none · ref 35
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 168
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation cs.LG · 2024-12-26 · unverdicted · none · ref 32
SyMerge merges models via single-layer adaptation and expert-guided self-labeling to achieve task synergy, reporting SOTA results on vision, dense prediction, and NLP tasks.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 53
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 151
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 108
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cs.CL · 2023-03-18 · unverdicted · none · ref 19
AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models cs.CL · 2023-03-15 · unverdicted · none · ref 28
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems cs.CL · 2019-05-02 · accept · none · ref 136
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
VIA-SD: Verification via Intra-Model Routing for Speculative Decoding cs.CL · 2026-06-10 · unverdicted · none · ref 70
VIA-SD adds a routed slim-verifier tier between direct acceptance and full-model verification in speculative decoding, cutting rejection rates 0.10-0.22 and yielding 10-20% speedups over prior SD methods.
Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection cs.CL · 2026-06-03 · unverdicted · none · ref 51
Survey of 155 researchers finds 44% observed LLM usage in crowdsourced data, with high awareness but insufficient mitigation efforts.
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering cs.CL · 2026-05-30 · unverdicted · none · ref 44
OCC-RAG develops task-specialized SLMs (0.6B and 1.7B) via a new synthetic data pipeline for multi-hop reasoning and context faithfulness, claiming to match or exceed 2-6x larger general models on HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un.
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets cs.CL · 2026-05-06 · unverdicted · none · ref 32
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 7
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 11
Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recovering much of the lost performance.
Analyzing the Effect of Noise in LLM Fine-tuning cs.LG · 2026-04-14 · unverdicted · none · ref 15
Label noise hurts fine-tuning performance most while grammatical and typographical noise sometimes act as mild regularizers, with changes concentrated in task-specific layers.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning cs.CL · 2026-04-10 · unverdicted · none · ref 30
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 241
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

SQ u AD : 100,000+ questions for machine comprehension of text

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer