hub Canonical reference

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenh · 2025 · arXiv 2508.06600

Canonical reference. 80% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 26 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

AgentBeats implements agentified evaluation of diverse AI agents through standardized interfaces, validated at scale in a five-month competition with 298 judges and 467 subjects plus a coding case study.

Towards Retrieving Interaction Spaces for Agentic Search

cs.IR · 2026-06-05 · unverdicted · novelty 7.0

RISE uses BM25 to bound interaction spaces for agentic search and pre-processes documents for shell navigation, matching direct corpus interaction accuracy at roughly one-quarter the cost on BrowseComp-Plus.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.

In-Context Credit Assignment via the Core

cs.GT · 2026-05-07 · unverdicted · novelty 7.0

Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems

cs.IR · 2026-07-01 · unverdicted · novelty 6.0 · 2 refs

PlanRAG models natural language exploratory reasoning problems as logical query trees, optimizes them via dynamic programming with a multi-dimensional cost model, and executes iterative retrieval-generation over the trees to outperform prior RAG methods on a new dataset.

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.

The Illusion of Multi-Agent Advantage

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

Automatically generated multi-agent systems underperform CoT-SC on benchmarks and a new diagnostic dataset, exposing architectural bloat that fails to deliver functional utility.

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

IntentKV prunes KV cache using cross-turn intent memory and attention scoring, achieving up to 77.8% reduction in worst-case peak tokens and 92.6% in KV reads at 8k budget with negligible accuracy drop on Qwen models.

Natural Language Query to Configuration for Retrieval Agents

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

cs.IR · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup to conflict resolution and missing-information detection.

MARCA: A Checklist-Based Benchmark for Multilingual Web Search

cs.CL · 2026-04-15 · accept · novelty 6.0

MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

Towards Long-horizon Agentic Multimodal Search

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.

Towards Knowledgeable Deep Research: Framework and Benchmark

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

Learning to Retrieve from Agent Trajectories

cs.IR · 2026-03-30 · conditional · novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

cs.LG · 2026-03-26 · unverdicted · novelty 6.0

A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Towards Retrieving Interaction Spaces for Agentic Search cs.IR · 2026-06-05 · unverdicted · none · ref 2
RISE uses BM25 to bound interaction spaces for agentic search and pre-processes documents for shell navigation, matching direct corpus interaction accuracy at roughly one-quarter the cost on BrowseComp-Plus.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR · 2026-04-17 · unverdicted · none · ref 11
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems cs.IR · 2026-07-01 · unverdicted · none · ref 4 · 2 links
PlanRAG models natural language exploratory reasoning problems as logical query trees, optimizes them via dynamic programming with a multi-dimensional cost model, and executes iterative retrieval-generation over the trees to outperform prior RAG methods on a new dataset.
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge cs.IR · 2026-05-05 · unverdicted · none · ref 3 · 2 links
EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup to conflict resolution and missing-information detection.
Learning to Retrieve from Agent Trajectories cs.IR · 2026-03-30 · conditional · none · ref 1
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer