{"total":11,"items":[{"citing_arxiv_id":"2606.11926","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","primary_cat":"cs.CL","submitted_at":"2026-06-10T10:57:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01314","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-31T16:01:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16217","ref_index":39,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Argus: Evidence Assembly for Scalable Deep Research Agents","primary_cat":"cs.CL","submitted_at":"2026-05-15T17:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00136","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-04-30T18:46:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Tool-augmented LLM reasoning incurs a protocol-induced performance tax that can exceed tool benefits under semantic noise, partially mitigated by a lightweight gate called G-STEP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14362","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI","primary_cat":"cs.CL","submitted_at":"2026-04-15T19:25:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"APEX-MEM uses property graphs with temporal events, append-only storage, and an agentic retrieval system to reach 88.88% accuracy on LOCOMO QA and 86.2% on LongMemEval, outperforming prior session-aware methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07655","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-08T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"concerns, harmless with honesty concerns, harmful (toxicity), and harmful (jailbreak). agent conversations as temporal graphs to arrest hallucination propagation [94]. Silent Guardian embeds adversarial tokens that cause compliant models to halt generation, achieving near-100% refusal rates [89], while Bergeron deploys a secondary \"conscience\" LLM to monitor a primary model and multiplies attack resistance seven-fold [59]. Meta's open-source Prompt Guard toolkit enables rule-based prompt filtering and evaluation pipelines for production systems [51]. A data-free methodology trains off-topic detectors without real user logs, thereby easing the deployment of guardrails before launch [9]. In robotics, RoboGuard fuses temporal-logic synthesis with an LLM \"root-of-trust\" to keep physical agents safe under jailbreak attacks [62]."},{"citing_arxiv_id":"2604.04651","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents","primary_cat":"cs.AI","submitted_at":"2026-04-06T13:00:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A fine-tuning policy trains small language models to search reliably and use evidence, improving multi-hop QA performance by 15-17 points to reach large-model levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.02766","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoSkill: Automated Skill Discovery for Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-03-03T09:07:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EvoSkill evolves agent skills via failure analysis and Pareto frontier selection, raising exact-match accuracy 7.3% on OfficeQA and 12.1% on SealQA with 5.3% zero-shot transfer to BrowseComp.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02276","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kimi K2.5: Visual Agentic Intelligence","primary_cat":"cs.CL","submitted_at":"2026-02-02T16:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"•Agentic Capabilities: BrowseComp [68], WideSearch [69],DeepSearchQA [60], FinSearchComp (T2&T3) [26], Seal-0 [45], GDPVal [43]. •Image Understanding:(math & reasoning)MMMU-Pro [75], MMMU (val) [76], CharXiv (RQ) [67], Math- Vision [61] and MathVista (mini) [36];(vision knowledge)SimpleVQA [13] and WorldVQA 2;(perception) ZeroBench (w/ and w/o tools) [48], BabyVision [12], BLINK [18] and MMVP [57];(OCR & document)OCR- Bench [35], OmniDocBench 1.5 [42] and InfoVQA [38]. •Video Understanding: VideoMMMU [25], MMVU [79], MotionBench [24], Video-MME [17] (with subtitles), LongVideoBench [70], and LVBench [62]. •Computer Use: OSWorld-Verified [72, 73], and WebArena [80]. BaselinesWe benchmark against state-of-the-art proprietary and open-source models."},{"citing_arxiv_id":"2601.08605","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ExpSeek: Self-Triggered Experience Seeking for Web Agents","primary_cat":"cs.CL","submitted_at":"2026-01-13T14:48:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpSeek shifts web agents to self-triggered step-level experience seeking via entropy thresholds, delivering 9.3% and 7.5% absolute gains on Qwen3-8B and 32B models across four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.11793","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling","primary_cat":"cs.CL","submitted_at":"2025-11-14T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}