hub Mixed citations

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese · 2024 · cs.CL · arXiv 2411.04368

Mixed citation behavior. Most common role is background (62%).

51 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 51 citing papers arXiv PDF

abstract

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 2 method 1

citation-polarity summary

background 5 use dataset 2 use method 1

representative citing papers

Reported Confidence in LLMs Tracks Commitment More Than Correctness

cs.LG · 2026-06-28 · unverdicted · novelty 8.0

Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment

cs.IR · 2026-06-28 · unverdicted · novelty 7.0

Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

cs.IR · 2026-06-01 · unverdicted · novelty 7.0

PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

NameBERT models trained on LLM-augmented academic name data outperform state-of-the-art baselines in nationality classification from names, with augmentation providing gains especially on tail countries.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

cs.CL · 2026-03-17 · unverdicted · novelty 7.0 · 2 refs

CounterRefine improves factual QA accuracy by up to 5.8 points on SimpleQA through answer-conditioned counterevidence retrieval and validated refinement with minimal output changes.

Evaluating the Search Agent in a Parallel World

cs.AI · 2026-03-05 · unverdicted · novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework

cs.CL · 2025-09-25 · conditional · novelty 7.0

Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD uses dynamic mixing of the model's expert and naive conditionals to create distribution-aligned supervision that improves the memorization-retention tradeoff over standard SFT.

Decomposing and Steering Functional Metacognition in Large Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

Evaluation of Agents under Simulated AI Marketplace Dynamics

cs.IR · 2026-04-15 · unverdicted · novelty 6.0

Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

cs.CL · 2026-04-09 · conditional · novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.

WRAP++: Web discoveRy Amplified Pretraining

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

WRAP++ amplifies Wikipedia data from 8.4B to 80B tokens by creating cross-document QA from hyperlink motifs, yielding better SimpleQA performance and scaling for 7B and 32B OLMo models than single-document methods.

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

cs.CR · 2026-04-07 · unverdicted · novelty 6.0

Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.

Causal Evidence that Language Models use Confidence to Drive Behavior

cs.LG · 2026-03-23 · unverdicted · novelty 6.0

Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

cs.LG · 2026-02-08 · conditional · novelty 6.0

OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 83 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

Measuring short-form factuality in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer