pith. sign in

hub Mixed citations

Measuring short-form factuality in large language models

Mixed citation behavior. Most common role is background (62%).

51 Pith papers citing it
Background 62% of classified citations
abstract

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

hub tools

citation-role summary

background 5 dataset 2 method 1

citation-polarity summary

clear filters

representative citing papers

Evaluating the Search Agent in a Parallel World

cs.AI · 2026-03-05 · unverdicted · novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.

Evaluation of Agents under Simulated AI Marketplace Dynamics

cs.IR · 2026-04-15 · unverdicted · novelty 6.0

Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.

WRAP++: Web discoveRy Amplified Pretraining

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

WRAP++ amplifies Wikipedia data from 8.4B to 80B tokens by creating cross-document QA from hyperlink motifs, yielding better SimpleQA performance and scaling for 7B and 32B OLMo models than single-document methods.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 83 · internal anchor

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.