hub Mixed citations

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese · 2024 · cs.CL · arXiv 2411.04368

Mixed citation behavior. Most common role is background (67%).

69 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 69 citing papers arXiv PDF

abstract

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 2 method 1

citation-polarity summary

background 6 use dataset 2 use method 1

representative citing papers

Reported Confidence in LLMs Tracks Commitment More Than Correctness

cs.LG · 2026-06-28 · unverdicted · novelty 8.0

Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

cs.CL · 2026-06-11 · unverdicted · novelty 8.0

LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

cs.AI · 2026-07-02 · accept · novelty 7.0

Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.

Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.

Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment

cs.IR · 2026-06-28 · unverdicted · novelty 7.0

Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

cs.IR · 2026-06-01 · unverdicted · novelty 7.0

PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

NameBERT models trained on LLM-augmented academic name data outperform state-of-the-art baselines in nationality classification from names, with augmentation providing gains especially on tail countries.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

cs.CL · 2026-03-17 · unverdicted · novelty 7.0 · 2 refs

CounterRefine improves factual QA accuracy by up to 5.8 points on SimpleQA through answer-conditioned counterevidence retrieval and validated refinement with minimal output changes.

Evaluating the Search Agent in a Parallel World

cs.AI · 2026-03-05 · unverdicted · novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework

cs.CL · 2025-09-25 · conditional · novelty 7.0

Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

MSQA benchmark shows LLMs exhibit cultural degradation where competence tracks pre-training data exposure more than reasoning ability, and inference fixes like sampling or retrieval do not close the gap.

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

CALIBER elicits and supervises pre-reasoning confidence with prompt-level success probability and post-reasoning confidence with answer-level correctness, cutting ECE by 52.5% on BigMathDigits for a 7B model while remaining competitive on accuracy.

You Don't Need to Run Every Eval

cs.LG · 2026-06-22 · conditional · novelty 6.0

The benchmark score matrix of 84 models on 133 tasks is approximately rank-2; BenchPress recovers held-out scores to within 4.6 points and identifies 5-benchmark subsets that predict the full scorecard to within 3.93-4.55 points.

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

cs.AI · 2026-06-17 · conditional · novelty 6.0

DSG decouples search grounding from LLM reasoning via an MCP-compatible gateway, nearly matching native accuracy on QA benchmarks at 91% lower cost while preserving output contracts and cutting production costs by over 98%.

citing papers explorer

Showing 12 of 12 citing papers after filters.

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge cs.AI · 2026-07-02 · accept · none · ref 10 · internal anchor
Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.
Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing cs.AI · 2026-07-02 · unverdicted · none · ref 28 · internal anchor
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 127 · internal anchor
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
Evaluating the Search Agent in a Parallel World cs.AI · 2026-03-05 · unverdicted · none · ref 23 · internal anchor
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents cs.AI · 2026-06-17 · conditional · none · ref 17 · internal anchor
DSG decouples search grounding from LLM reasoning via an MCP-compatible gateway, nearly matching native accuracy on QA benchmarks at 91% lower cost while preserving output contracts and cutting production costs by over 98%.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification cs.AI · 2026-04-18 · unverdicted · none · ref 46 · internal anchor
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack cs.AI · 2026-06-16 · unverdicted · none · ref 26 · internal anchor
DeepInsight introduces a unified evaluation infrastructure for the full Physical AI stack using three invariant abstractions to enable cross-layer diagnostics on one runtime.
SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data cs.AI · 2026-06-15 · unverdicted · none · ref 5 · internal anchor
SpecAlign synthesizes boundary-aware preference pairs directly from structured model specifications to train LLMs for improved rule compliance.
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems cs.AI · 2026-05-21 · unverdicted · none · ref 125 · internal anchor
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.
Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm cs.AI · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
Brick routes queries to LLMs using capability scores and difficulty estimates, reaching 76.98% accuracy at max-quality and 4.71x lower cost at neutral profile on 5,504 queries versus always using the strongest model.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools cs.AI · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
VeRO: A Harness for Agents to Optimize Agents cs.AI · 2026-02-25 · unreviewed · ref 25 · internal anchor

Measuring short-form factuality in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer