{"total":21,"items":[{"citing_arxiv_id":"2605.18133","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments","primary_cat":"cs.CR","submitted_at":"2026-05-18T09:38:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15184","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is Grep All You Need? How Agent Harnesses Reshape Agentic Search","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:58:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09544","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:56:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09038","ref_index":24,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks","primary_cat":"cs.AI","submitted_at":"2026-05-09T16:23:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the effect of adding retrieved evidence without explicit skill selection, query planning, or grounded stopping. Search-o1 [17], Search-R1 [11], and ZeroSearch [27] represent recent search-native agents for multi-turn retrieval and search-oriented post-training. This suite separates gains from reasoning, retrieval access, and skill-conditioned search control. 4.3 Experimental setup Backbones.We study Qwen2.5-7B and Qwen2.5-3B [ 24], and report results for base and instruct variants whenever the corresponding baseline is available. This lets us test whether SearchSkill remains effective across different model scales and alignment levels, rather than only on a single strong instruction-tuned checkpoint. Training data and trajectories.Following Section 3, we build trajectories from HotpotQA, 2Wiki-"},{"citing_arxiv_id":"2605.07675","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FactoryBench: Evaluating Industrial Machine Understanding","primary_cat":"cs.AI","submitted_at":"2026-05-08T12:47:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. [30] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023. [31] Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zhiyuan Zeng, Yujia Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models.ACM Computing Surveys, 2024. arXiv:2304.08354. 11 [32] Bryan Lim, Sercan Ö. Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting."},{"citing_arxiv_id":"2605.00060","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data","primary_cat":"cs.AI","submitted_at":"2026-04-30T03:19:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.13958","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ToolRL: Reward is All Tool Learning Needs","primary_cat":"cs.LG","submitted_at":"2025-04-16T21:45:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.23218","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","primary_cat":"cs.CL","submitted_at":"2024-10-30T17:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.00557","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Ask: When LLM Agents Meet Unclear Instruction","primary_cat":"cs.CL","submitted_at":"2024-08-31T23:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.07960","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments","primary_cat":"cs.HC","submitted_at":"2024-05-13T17:38:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.13501","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on the Memory Mechanism of Large Language Model based Agents","primary_cat":"cs.AI","submitted_at":"2024-04-21T01:49:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[10] and Liu et al. [11] present surveys on the alignment of LLMs, which is a key requirement for LLMs to produce outputs consistent with human values. Gao et al. [12] propose a survey on the retrieval-augmented generation (RAG) capability of LLMs, which is key to providing LLMs with factual and up-to-date knowledge and removing hallucinations. Qin et al. [18] summarize the state-of-the-art methods on enabling LLMs to leverage external tools, which is fundamental for LLMs to expand their capability in domains that require specialized knowledge. Wang et al. [13], Yao et al. [14], Wang et al. [15], Feng et al. [16] and Zhang et al. [17] present surveys on the direction of LLM knowledge editing, which is important for customizing LLMs to"},{"citing_arxiv_id":"2403.17297","ref_index":155,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternLM2 Technical Report","primary_cat":"cs.CL","submitted_at":"2024-03-26T00:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.02716","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Understanding the planning of LLM agents: A survey","primary_cat":"cs.AI","submitted_at":"2024-02-05T04:25:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10774","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","primary_cat":"cs.LG","submitted_at":"2024-01-19T15:48:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05459","ref_index":241,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","primary_cat":"cs.HC","submitted_at":"2024-01-10T09:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Their feedback and interventions are used to iteratively improve the model's performance and align it with desired standards. Self-Reflection. It has been shown that language models can provide probabilities of providing correct answers [416]. Inspired by the autonomous operation of LLMs, researchers have suggested leveraging the model's self-reflection to mitigate the problem of incorrect content generation. Huang et al. [241] and Madaan et al. [417] show that LLMs are capable of self-improving with unlabeled data, Shinn et al. [418] propose Reflexion to let LLMs update through its linguistic feedback. Chen et al. [419] propose Self-Debug to iteratively improve the responses on several code generation tasks. SelfCheckGPT [ 420] allows large models to provide answers to the same input question multiple"},{"citing_arxiv_id":"2309.07864","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2023-09-14T17:12:03+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"primary component of brain or controller of these agents and expand their perceptual and action space through strategies such as multimodal perception and tool utilization [90; 91; 92; 93; 94]. These LLM- based agents can exhibit reasoning and planning abilities comparable to symbolic agents through techniques like Chain-of-Thought (CoT) and problem decomposition [95; 96; 97; 98; 99; 100; 101]. They can also acquire interactive capabilities with the environment, akin to reactive agents, by learning from feedback and performing new actions [ 102; 103; 104]. Similarly, large language models undergo pre-training on large-scale corpora and demonstrate the capacity for few-shot and zero-shot generalization, allowing for seamless transfer between tasks without the need to update"},{"citing_arxiv_id":"2309.01219","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-03T16:56:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.11432","ref_index":151,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Model based Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2023-08-22T13:30:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"creating customized LLM-based agent simulations efficiently. GPT Researcher [150] is an experimen- tal application that leverages LLMs to e fficiently Lei Wang et al. A Survey on Large Language Model based Autonomous Agents 27 develop research questions, trigger web crawls to gather information, summarize sources, and aggre- gate summaries. BMTools [151] provides a plat- form for community-driven tool building and shar- ing. It supports various types of tools, enables si- multaneous task execution using multiple tools, and offers a simple interface for loading plugins via URLs, fostering easy development and contribution to the BMTools ecosystem. Remark. Utilization of LLM-based agents in sup- porting above applications may also entail risks and"},{"citing_arxiv_id":"2306.06070","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mind2Web: Towards a Generalist Agent for the Web","primary_cat":"cs.CL","submitted_at":"2023-06-09T17:44:31+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.18323","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-23T00:16:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.15010","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","primary_cat":"cs.CV","submitted_at":"2023-04-28T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}