BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Pith reviewed 2026-05-12 07:39 UTC · model grok-4.3
The pith
BrowseComp offers 1,266 short-answer questions to test agents' persistence and creativity while browsing the web for hard-to-find information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information.
What carries the argument
The BrowseComp dataset of 1,266 questions, each engineered to demand repeated web navigation to assemble entangled facts into short verifiable answers.
If this is right
- Agents that perform well on BrowseComp demonstrate stronger ability to sustain search effort across multiple steps.
- The benchmark supplies a standardized, automatically scorable test that can track iterative improvements in browsing agents.
- Success on BrowseComp indicates progress on locating information that is distributed across pages rather than available in a single search.
- The dataset can serve as a training signal for agents by rewarding sequences of navigation actions that reach the reference answer.
Where Pith is reading between the lines
- If the benchmark proves predictive, researchers could use it to prioritize agent architectures that maintain long search chains over those optimized only for single-step retrieval.
- Extending the questions with time-to-answer metrics would let developers measure not just accuracy but also the efficiency of persistence.
- The approach could generalize to other domains, such as scientific literature search, where facts are similarly scattered.
Load-bearing premise
Short, easily verifiable answers are enough to measure the persistence and creativity that matter for real browsing.
What would settle it
An experiment showing that agents scoring high on BrowseComp still fail to locate comparable information when the questions are rephrased into open-ended or ambiguous real-world tasks.
read the original abstract
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents BrowseComp, a benchmark of 1,266 questions intended to evaluate web-browsing agents on their ability to persistently navigate the internet in search of hard-to-find, entangled information. Answers are short and easily verifiable against references, and the benchmark is explicitly framed as an incomplete but useful proxy (analogous to programming competitions) that isolates the core capabilities of persistence and creativity while deliberately avoiding long-form output and ambiguity.
Significance. If the questions are shown to be well-constructed and to genuinely require the claimed navigation behaviors, BrowseComp could become a practical, reproducible standard for measuring a key agent capability. The public GitHub release supports immediate use and extension by the community.
major comments (2)
- [Abstract] Abstract: the central claim that the 1,266 questions 'require persistently navigating the internet in search of hard-to-find, entangled information' and thereby measure persistence and creativity is unsupported by any description of question sourcing, validation, difficulty calibration, or inter-rater agreement.
- [Abstract] The design decision to use only short, verifiable answers is presented as sufficient to isolate the target capability, yet no evidence or analysis is supplied showing that this format actually elicits (rather than bypasses) the persistence and creativity the benchmark claims to measure.
minor comments (1)
- [Abstract] The GitHub link is given, but the paper would benefit from one or two concrete question examples to illustrate the intended difficulty and verification process.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback on our manuscript. We agree that the abstract would benefit from additional supporting details and have revised the paper accordingly to strengthen the presentation of the benchmark.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the 1,266 questions 'require persistently navigating the internet in search of hard-to-find, entangled information' and thereby measure persistence and creativity is unsupported by any description of question sourcing, validation, difficulty calibration, or inter-rater agreement.
Authors: We agree that the abstract lacks explicit details on these aspects. The full manuscript contains a Benchmark Construction section that describes sourcing questions from publicly available web content requiring multi-page navigation to resolve entangled facts, followed by manual verification of each answer against the source material and iterative difficulty calibration via pilot runs with baseline agents. We will revise the abstract to briefly reference this process and expand the main text with additional specifics on validation. Inter-rater agreement is not applicable in the conventional sense because each question has a single, objectively verifiable short answer; we will add a clarifying note on this point. revision: yes
-
Referee: [Abstract] The design decision to use only short, verifiable answers is presented as sufficient to isolate the target capability, yet no evidence or analysis is supplied showing that this format actually elicits (rather than bypasses) the persistence and creativity the benchmark claims to measure.
Authors: The short-answer format is chosen precisely to isolate persistence and information-seeking from the separate challenges of long-form generation and ambiguity, consistent with the programming-competition analogy stated in the paper. The manuscript already includes example questions (in the main text and appendix) that illustrate the need for multi-step browsing. To provide stronger evidence, we will add a new analysis subsection reporting quantitative metrics on agent behavior, such as the distribution of page visits and tool calls for solved versus unsolved questions, demonstrating that high performance correlates with persistent navigation rather than shortcuts. revision: yes
Circularity Check
No circularity: benchmark definition with no derivations or self-referential reductions
full rationale
The paper introduces BrowseComp as a collection of 1,266 questions without any equations, fitted parameters, predictions, or derivation chain. It directly defines the benchmark, notes its analogy to programming competitions as an illustrative framing, and states its scope limitations explicitly. No self-citations, ansatzes, or uniqueness claims reduce any result to its own inputs by construction. The central claim—that short verifiable answers isolate persistence and creativity—is presented as a design choice rather than a derived theorem, making the work self-contained by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected questions require persistent navigation and creativity to solve.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information... measures the important core capability of exercising persistence and creativity in finding information.
-
IndisputableMonolith/Foundation/DAlembert/Inevitability.leanbilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents.
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dom...
-
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
-
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.
-
Inference-Time Budget Control for LLM Search Agents
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
-
PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization
PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.
-
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating subs...
-
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...
-
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
Evaluating the Search Agent in a Parallel World
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping ...
-
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
-
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.
-
MemEvolve: Meta-Evolution of Agent Memory Systems
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
-
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.
-
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
-
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
-
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.
-
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
-
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versu...
-
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
-
WebSailor: Navigating Super-human Reasoning for Web Agent
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
Real-Time Execution of Action Chunking Flow Policies
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-created set of 256 expert-level questions that exposes low performance (16.85% mean pass rate) in current AI deep research systems while serving as a classroom method for accountable AI education.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
AlphaEval: Evaluating Agents in Production
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
-
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
-
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
-
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.