{"total":15,"items":[{"citing_arxiv_id":"2605.19932","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T14:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15156","ref_index":61,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MeMo: Memory as a Model","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12913","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting DAgger in the Era of LLM-Agents","primary_cat":"cs.LG","submitted_at":"2026-05-13T02:40:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09539","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06920","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Context Credit Assignment via the Core","primary_cat":"cs.GT","submitted_at":"2026-05-07T20:30:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05253","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge","primary_cat":"cs.IR","submitted_at":"2026-05-05T20:23:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup to conflict resolution and missing-information detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04018","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems","primary_cat":"cs.CL","submitted_at":"2026-05-05T17:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For normalization, we compute IDCGα@k on the gold pool using the same gain definition and a greedy maximiza- tion overkpositions, yielding α-nDCG@k= DCGα@k IDCGα@k .(5) Weighted Aspect Recall.To directly capture aspect coverage, we report a weighted aspect recall that credits each aspect once it has been covered at least once: A-Recall@k= mX j=1 wj ·1{C j(k)≥1}.(6) Recall@k and NDCG@k.As complementary metrics that ignore aspect structure, we additionally report the standard Recall@k(fraction of gold passages within the topk) and NDCG@k(with binary relevancerel r). D Reference Answer Validation We use GPT-5 with a high reasoning effort setting to generate one citation-grounded reference answer per query. The model is given the human-annotated reasoning aspects together with the full content of the positive passages"},{"citing_arxiv_id":"2604.25256","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery","primary_cat":"cs.AI","submitted_at":"2026-04-28T06:05:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023. [16] Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. 11 [17] Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025. [18] Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu"},{"citing_arxiv_id":"2604.16576","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability","primary_cat":"cs.IR","submitted_at":"2026-04-17T13:02:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"design of more reliable retrievers. Specifically, our study is guided by the following three research questions: RQ1: How generalizable are LLM-based dense retrievers across diverse conditions?We address this by evaluating SOTA open-source LLM-based retrievers on a comprehensive suite of 30 datasets spanning 4 benchmarks: MS MARCO [3], BEIR [66], BRIGHT [62], and BrowseComp-Plus [11]. To enable fine-grained analysis, we categorize these datasets according to 11 task types, 8 query types, and 5 corpus source types. Critically, standard aggregation metrics (e.g., macro-averaging) can be misleading due to significant variances in dataset size and query difficulty. To overcome this, we introducelinear mixed-effects models (LMMs)[ 23] to estimate marginal mean performance while controlling"},{"citing_arxiv_id":"2604.14448","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MARCA: A Checklist-Based Benchmark for Multilingual Web Search","primary_cat":"cs.CL","submitted_at":"2026-04-15T21:54:27+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12890","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Long-horizon Agentic Multimodal Search","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:40:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982-3992, 2019. [38] Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025. [39] Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025. [40] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances"},{"citing_arxiv_id":"2604.07720","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Knowledgeable Deep Research: Framework and Benchmark","primary_cat":"cs.AI","submitted_at":"2026-04-09T02:06:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"such as Gemini [12] and Perplexity [1], where these capabilities are regarded as a hallmark of advanced agentic reasoning and tool proficiency [41]. In parallel, the open-source community strives to narrow the gap with proprietary models. Existing efforts gener- ally fall into two categories: constructing robust multi-agent work- flows to emulate closed-source systems [6, 15, 16], or employing agentic reinforcement learning [8, 36, 40] to train LLMs to master complex tool usage like information seeking and long-form writ- ing [17, 18, 30, 42]. Nevertheless, most existing deep research agents primarily operate over unstructured web resources, with limited support for computation and reasoning over structured knowledge."},{"citing_arxiv_id":"2604.03189","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reflective Context Learning: Studying the Optimization Primitives of Context Space","primary_cat":"cs.LG","submitted_at":"2026-04-03T17:05:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04949","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Retrieve from Agent Trajectories","primary_cat":"cs.IR","submitted_at":"2026-03-30T17:59:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.25342","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents","primary_cat":"cs.LG","submitted_at":"2026-03-26T11:37:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improving performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}