pith. sign in

hub Mixed citations

Measuring short-form factuality in large language models

Mixed citation behavior. Most common role is background (67%).

67 Pith papers citing it
Background 67% of classified citations
abstract

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

hub tools

citation-role summary

background 6 dataset 2 method 1

citation-polarity summary

clear filters

representative citing papers

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

Evaluating the Search Agent in a Parallel World

cs.AI · 2026-03-05 · unverdicted · novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.

You Don't Need to Run Every Eval

cs.LG · 2026-06-22 · conditional · novelty 6.0

The benchmark score matrix of 84 models on 133 tasks is approximately rank-2; BenchPress recovers held-out scores to within 4.6 points and identifies 5-benchmark subsets that predict the full scorecard to within 3.93-4.55 points.

citing papers explorer

Showing 12 of 12 citing papers after filters.