Title resolution pending

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang · 2016 · Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/d16-1264

26 Pith papers cite this work, alongside 2,627 external citations. Polarity classification is still indexing.

26 Pith papers citing it

2,627 external citations · Crossref

open at publisher browse 26 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

cs.IT · 2026-05-01 · unverdicted · novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

cs.CL · 2026-04-25 · unverdicted · novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

The Power of Scale for Parameter-Efficient Prompt Tuning

cs.CL · 2021-04-18 · unverdicted · novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

cs.CL · 2019-05-24 · accept · novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

cs.CR · 2026-05-06 · conditional · novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

cs.CL · 2023-03-18 · unverdicted · novelty 6.0

AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

cs.CL · 2026-05-06 · unverdicted · novelty 5.0

An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

cs.LG · 2026-05-01 · unverdicted · novelty 5.0

EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 5.0

Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recovering much of the lost performance.

Analyzing the Effect of Noise in LLM Fine-tuning

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Label noise hurts fine-tuning performance most while grammatical and typographical noise sometimes act as mild regularizers, with changes concentrated in task-specific layers.

Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.

PaLM 2 Technical Report

cs.CL · 2023-05-17 · unverdicted · novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

citing papers explorer

Showing 18 of 18 citing papers after filters.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 1
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints cs.CL · 2026-05-09 · unverdicted · none · ref 20
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization cs.LG · 2026-05-07 · unverdicted · none · ref 34
PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference cs.LG · 2026-05-05 · unverdicted · none · ref 21
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations cs.LG · 2026-05-04 · unverdicted · none · ref 112
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass cs.IT · 2026-05-01 · unverdicted · none · ref 69
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · unverdicted · none · ref 55
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head cs.CL · 2026-05-12 · unverdicted · none · ref 31
PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference cs.CR · 2026-05-06 · conditional · none · ref 65
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models cs.AI · 2026-04-19 · unverdicted · none · ref 26
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching cs.AI · 2026-04-16 · unverdicted · none · ref 28
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets cs.CL · 2026-05-06 · unverdicted · none · ref 32
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 7
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 11
Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recovering much of the lost performance.
Analyzing the Effect of Noise in LLM Fine-tuning cs.LG · 2026-04-14 · unverdicted · none · ref 15
Label noise hurts fine-tuning performance most while grammatical and typographical noise sometimes act as mild regularizers, with changes concentrated in task-specific layers.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning cs.CL · 2026-04-10 · unverdicted · none · ref 30
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 204 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding cs.CL · 2026-05-11 · unverdicted · none · ref 55
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer