hub

Zixin Ding, Junyuan Hong, Zhan Shi, Jiachen T

URLhttps://arxiv · 2025 · arXiv 2508.06600

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.

In-Context Credit Assignment via the Core

cs.GT · 2026-05-07 · unverdicted · novelty 7.0

Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

cs.IR · 2026-05-05 · conditional · novelty 7.0

EnterpriseRAG-Bench supplies a synthetic corpus of 500,000 documents across Slack, Gmail, GitHub and other tools plus 500 questions that probe lookup, multi-document reasoning, conflict resolution and absence detection.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

MARCA: A Checklist-Based Benchmark for Multilingual Web Search

cs.CL · 2026-04-15 · accept · novelty 6.0

MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

Towards Long-horizon Agentic Multimodal Search

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.

Towards Knowledgeable Deep Research: Framework and Benchmark

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

Learning to Retrieve from Agent Trajectories

cs.IR · 2026-03-30 · conditional · novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

cs.LG · 2026-03-26 · unverdicted · novelty 6.0

A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.

citing papers explorer

Showing 13 of 13 citing papers.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 17
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems cs.CL · 2026-05-10 · unverdicted · none · ref 42
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
In-Context Credit Assignment via the Core cs.GT · 2026-05-07 · unverdicted · none · ref 5
Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge cs.IR · 2026-05-05 · conditional · none · ref 3
EnterpriseRAG-Bench supplies a synthetic corpus of 500,000 documents across Slack, Gmail, GitHub and other tools plus 500 questions that probe lookup, multi-document reasoning, conflict resolution and absence detection.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL · 2026-05-05 · unverdicted · none · ref 6
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR · 2026-04-17 · unverdicted · none · ref 11
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 9
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search cs.CL · 2026-04-15 · accept · none · ref 6
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
Towards Long-horizon Agentic Multimodal Search cs.CV · 2026-04-14 · unverdicted · none · ref 39
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
Towards Knowledgeable Deep Research: Framework and Benchmark cs.AI · 2026-04-09 · unverdicted · none · ref 6
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 4
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Learning to Retrieve from Agent Trajectories cs.IR · 2026-03-30 · conditional · none · ref 1
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents cs.LG · 2026-03-26 · unverdicted · none · ref 1
A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.

Zixin Ding, Junyuan Hong, Zhan Shi, Jiachen T

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer