Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.
hub Mixed citations
Measuring short-form factuality in large language models
Mixed citation behavior. Most common role is background (67%).
abstract
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.
Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
NameBERT models trained on LLM-augmented academic name data outperform state-of-the-art baselines in nationality classification from names, with augmentation providing gains especially on tail countries.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
CounterRefine improves factual QA accuracy by up to 5.8 points on SimpleQA through answer-conditioned counterevidence retrieval and validated refinement with minimal output changes.
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.
MSQA benchmark shows LLMs exhibit cultural degradation where competence tracks pre-training data exposure more than reasoning ability, and inference fixes like sampling or retrieval do not close the gap.
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
The benchmark score matrix of 84 models on 133 tasks is approximately rank-2; BenchPress recovers held-out scores to within 4.6 points and identifies 5-benchmark subsets that predict the full scorecard to within 3.93-4.55 points.
DSG decouples search grounding from LLM reasoning via an MCP-compatible gateway, nearly matching native accuracy on QA benchmarks at 91% lower cost while preserving output contracts and cutting production costs by over 98%.
MixSD uses dynamic mixing of the model's expert and naive conditionals to create distribution-aligned supervision that improves the memorization-retention tradeoff over standard SFT.
citing papers explorer
-
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.
-
Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
-
Can AI Agents Synthesize Scientific Conclusions?
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
-
Evaluating the Search Agent in a Parallel World
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
-
Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents
DSG decouples search grounding from LLM reasoning via an MCP-compatible gateway, nearly matching native accuracy on QA benchmarks at 91% lower cost while preserving output contracts and cutting production costs by over 98%.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack
DeepInsight introduces a unified evaluation infrastructure for the full Physical AI stack using three invariant abstractions to enable cross-layer diagnostics on one runtime.
-
SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data
SpecAlign synthesizes boundary-aware preference pairs directly from structured model specifications to train LLMs for improved rule compliance.
-
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.
-
Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm
Brick routes queries to LLMs using capability scores and difficulty estimates, reaching 76.98% accuracy at max-quality and 4.71x lower cost at neutral profile on 5,504 queries versus always using the strongest model.
-
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
- VeRO: A Harness for Agents to Optimize Agents