{"total":16,"items":[{"citing_arxiv_id":"2606.16364","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-15T07:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12191","ref_index":268,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application","primary_cat":"cs.CL","submitted_at":"2026-06-10T15:15:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00135","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Effectiveness and Efficiency of Agentic Tool-calling and RL Training","primary_cat":"cs.LG","submitted_at":"2026-05-28T22:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20876","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Terminal-World: Scaling Terminal-Agent Environments via Agent Skills","primary_cat":"cs.CL","submitted_at":"2026-05-20T08:14:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17558","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs","primary_cat":"cs.SE","submitted_at":"2026-05-17T17:38:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18805","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents","primary_cat":"cs.IR","submitted_at":"2026-05-11T18:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10832","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T16:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03476","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification","primary_cat":"cs.CL","submitted_at":"2026-05-05T08:05:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"for precise reasoning, and they lack systematic detection mechanisms for hallucination in generated content. Methodologically, CuraView's dual-agent design combines two emerging directions. First, ad- versarial sample generation continually constructs challenging errors, improving detector robustness against diverse error types. Second, the LLM-as-a-judge paradigm [42][43] suggests that evidence- centered structured evaluation is better suited to high-risk quality control than a single open-ended 12 response. The separation between CuraView's generation agent and detection agent is therefore not merely an engineering decomposition, but a principled design supporting two complementary objectives: producing hard cases and performing evidence-driven verification."},{"citing_arxiv_id":"2605.02572","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length","primary_cat":"cs.AI","submitted_at":"2026-05-04T13:25:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18292","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-secondInternational Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=2GmDdhBdDk. [74] Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam.arXiv preprint arXiv:2501.14249, 2025. [75] Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, and Caiming Xiong. Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay, 2025. URLhttps://arxiv.org/abs/2504."},{"citing_arxiv_id":"2604.09813","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-10T18:38:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12521","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues","primary_cat":"cs.CL","submitted_at":"2026-04-03T17:02:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24709","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards","primary_cat":"cs.LG","submitted_at":"2026-03-25T18:31:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21686","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework","primary_cat":"cs.CL","submitted_at":"2025-11-26T18:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.07407","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems","primary_cat":"cs.AI","submitted_at":"2025-08-10T16:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.07982","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment","primary_cat":"cs.AI","submitted_at":"2025-06-09T17:52:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"an evaluation pipeline to programmatically build synthetic test suites from structured policy graphs that encode domain rules and their co-occurrence statistics. IntellAgent explicitly uses τ-bench as an external gold standard, reporting a high Spearman correlation between the two score distributions, and acts as a fast, synthetic proxy task. APIGen-MT [19] explores the idea of fine-tuning tool-calling agents for τ-bench. They generate data by creating conversation blueprints which are sequences of tool calls that depend on each other, followed by simulating conversational traces based on each blueprint. ToolSandbox [14] focuses on creating stateful tools in order to evaluate agent progress in a more fine-grained manner."}],"limit":50,"offset":0}