hub Canonical reference

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent

Browsecomp-plus: A more fair, transparent evaluation benchmark of deep-research agent , author= · 2025 · arXiv 2508.06600

Canonical reference. 80% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 24 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Towards Retrieving Interaction Spaces for Agentic Search

cs.IR · 2026-06-05 · unverdicted · novelty 7.0

RISE uses BM25 to bound interaction spaces for agentic search and pre-processes documents for shell navigation, matching direct corpus interaction accuracy at roughly one-quarter the cost on BrowseComp-Plus.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.

In-Context Credit Assignment via the Core

cs.GT · 2026-05-07 · unverdicted · novelty 7.0

Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

IntentKV prunes KV cache using cross-turn intent memory and attention scoring, achieving up to 77.8% reduction in worst-case peak tokens and 92.6% in KV reads at 8k budget with negligible accuracy drop on Qwen models.

Natural Language Query to Configuration for Retrieval Agents

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

cs.IR · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup to conflict resolution and missing-information detection.

MARCA: A Checklist-Based Benchmark for Multilingual Web Search

cs.CL · 2026-04-15 · accept · novelty 6.0

MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

Towards Long-horizon Agentic Multimodal Search

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.

Towards Knowledgeable Deep Research: Framework and Benchmark

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

Learning to Retrieve from Agent Trajectories

cs.IR · 2026-03-30 · conditional · novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

cs.LG · 2026-03-26 · unverdicted · novelty 6.0

A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

AgentCL constructs controlled task streams with intentional reusability and introduces MemProbe to evaluate non-parametric memory designs for continual learning in language agents across coding, research, and reasoning tasks.

MeMo: Memory as a Model

cs.CL · 2026-05-14 · unverdicted · novelty 5.0 · 2 refs

MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.

When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems

cs.IR · 2026-07-01

citing papers explorer

Showing 24 of 24 citing papers.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 17
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 73
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Towards Retrieving Interaction Spaces for Agentic Search cs.IR · 2026-06-05 · unverdicted · none · ref 2
RISE uses BM25 to bound interaction spaces for agentic search and pre-processes documents for shell navigation, matching direct corpus interaction accuracy at roughly one-quarter the cost on BrowseComp-Plus.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 34
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems cs.CL · 2026-05-10 · unverdicted · none · ref 42
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
In-Context Credit Assignment via the Core cs.GT · 2026-05-07 · unverdicted · none · ref 5
Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL · 2026-05-05 · unverdicted · none · ref 6
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR · 2026-04-17 · unverdicted · none · ref 11
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
ECHO: Prune to act, trace to learn with selective turn memory in agentic RL cs.LG · 2026-06-30 · unverdicted · none · ref 1
ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.
IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference cs.LG · 2026-06-06 · unverdicted · none · ref 18
IntentKV prunes KV cache using cross-turn intent memory and attention scoring, achieving up to 77.8% reduction in worst-case peak tokens and 92.6% in KV reads at 8k budget with negligible accuracy drop on Qwen models.
Natural Language Query to Configuration for Retrieval Agents cs.AI · 2026-05-26 · unverdicted · none · ref 7
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 10
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 9
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge cs.IR · 2026-05-05 · unverdicted · none · ref 3 · 2 links
EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup to conflict resolution and missing-information detection.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search cs.CL · 2026-04-15 · accept · none · ref 6
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
Towards Long-horizon Agentic Multimodal Search cs.CV · 2026-04-14 · unverdicted · none · ref 39
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
Towards Knowledgeable Deep Research: Framework and Benchmark cs.AI · 2026-04-09 · unverdicted · none · ref 6
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 4
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Learning to Retrieve from Agent Trajectories cs.IR · 2026-03-30 · conditional · none · ref 1
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents cs.LG · 2026-03-26 · unverdicted · none · ref 1
A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning cs.AI · 2026-06-09 · unverdicted · none · ref 22
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents cs.AI · 2026-06-01 · unverdicted · none · ref 4
AgentCL constructs controlled task streams with intentional reusability and introduces MemProbe to evaluate non-parametric memory designs for continual learning in language agents across coding, research, and reasoning tasks.
MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 61 · 2 links
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems cs.IR · 2026-07-01 · unreviewed · ref 4

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer