Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Chishui Chen; Cong Qin; Jiaye Lin; Junxi Wang; Ke Zeng; Lu Pan; Te Sun; Yangen Hu; Yi Yang

arxiv: 2606.00510 · v1 · pith:ED43F72Dnew · submitted 2026-05-30 · 💻 cs.CL · cs.AI

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Chishui Chen , Jiaye Lin , Te Sun , Junxi Wang , Yi Yang , Cong Qin , Yangen Hu , Lu Pan

show 1 more author

Ke Zeng

This is my paper

Pith reviewed 2026-06-28 19:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords selective skill invocationpreference learningagentic tasksskill-or-skip decisiondual-granularitytrajectory prefixesALFWorldBFCL

0 comments

The pith

SelSkill teaches agents to treat skill use as a skill-or-skip choice and learn the right decision from dual-granularity preferences on shared trajectory prefixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that relevant skills are often invoked at moments that add noise and break correct execution paths even when the skill itself is useful in principle. SelSkill addresses this by framing every potential invocation point as a binary decision and building preference data that holds the preceding trajectory fixed while varying only the invoke-or-skip choice. Episode-level preferences on final success are combined with step-level preferences on the immediate effect of the decision. The resulting policy is shown to raise both task completion and execution precision on two agent benchmarks and to transfer to new tasks without retraining.

Core claim

SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation.

What carries the argument

Dual-granularity preference pairs that compare invoke versus skip actions on identical trajectory prefixes while jointly optimizing episode outcome and local invocation quality.

If this is right

Task success rates rise on ALFWorld and BFCL when agents learn to skip unhelpful invocations.
Execution precision increases by roughly thirty percentage points in both environments.
The learned invocation policy transfers zero-shot to new domains that contain previously unseen skills.
Agents avoid injecting irrelevant context that would otherwise disrupt an otherwise correct execution process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-controlled preference construction could be applied to other conditional actions in agent workflows, such as tool calls or memory writes.
Uncertainty-driven selection of decision points offers a practical way to focus preference data collection on moments that matter most.
Co-training the invocation policy together with the underlying skills themselves may further improve results once the skip decision is treated as first-class.

Load-bearing premise

That preference pairs built from shared trajectory prefixes isolate the causal effect of the invocation decision itself without confounding from later steps or trajectory quality.

What would settle it

An ablation that replaces shared-prefix pairs with pairs drawn from trajectories that already differ before the decision point and measures whether the reported gains on success and precision disappear.

Figures

Figures reproduced from arXiv: 2606.00510 by Chishui Chen, Cong Qin, Jiaye Lin, Junxi Wang, Ke Zeng, Lu Pan, Te Sun, Yangen Hu, Yi Yang.

**Figure 1.** Figure 1: Motivation for selective skill invocation. (a) A representative skill-or-skip case illustrates that a relevant skill may still be unnecessary for the current request. (b) Counterfactual analysis shows that beneficial effects of skill access are concentrated in only a small fraction of paired trajectories. (c) Episode-level feedback cannot directly identify the local contribution of each invocation. 2.3 Sel… view at source ↗

**Figure 2.** Figure 2: The overview of SelSkill. We construct episode-level trajectory preferences and entropy-guided decisionpoint preferences, and jointly optimize the policy for selective skill invocation. ¡20 0 20 40 60 Token position relative to skill call 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Token entropy Skill call Shared prefix Skip path Skill path [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Token entropy around invoke/skip points. outcome and efficiency, the comparison controls for the pre-branch trajectory and isolates the immediate skill-or-skip choice more directly than pairing independently sampled complete trajectories. To focus this signal on the local invoke/skip decision, we compute the DPO loss only within a local window after the branching point. Specifically, we apply a local los… view at source ↗

**Figure 4.** Figure 4: shows that episode-level preferences produce more dispersed gradient peaks, whereas step-level preferences concentrate them around the skill-call region. This suggests that episode-level preferences provide broad trajectory-level guid- ¡40 ¡30 ¡20 ¡10 0 10 20 30 40 50 Token position relative to skill call 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Density of gradient peak Skill call Episode-level Step… view at source ↗

read the original abstract

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SelSkill's dual-granularity DPO for skill-or-skip decisions produces clear benchmark lifts on ALFWorld and BFCL, but the shared-prefix pairs may not isolate the invocation effect cleanly.

read the letter

The main point is that this paper trains agents to decide at each step whether to call a relevant skill or skip it, using a combination of episode-level outcome preferences and step-level invocation preferences. They pick candidate points with predictive uncertainty and build invoke/skip pairs from identical trajectory prefixes up to the decision. That framing directly targets the problem of unhelpful calls that add noise or derail execution.

The construction is straightforward and the reported results are concrete. On ALFWorld with Qwen3-8B they show +10.9 points task success and +29.1 points execution precision; on BFCL the gains are +5.7 and +29.5. The zero-shot transfer numbers on Tau-bench and PopQA add some evidence that the policy is not overfit to the training environments.

The soft spot is the causal claim behind the preference pairs. Shared prefixes control the history before the choice, but in these environments a skill call typically injects new observations or changes the action space for later steps. If those downstream differences correlate with the final reward, the learned preference could be driven by later trajectory quality rather than the local invoke/skip decision itself. The abstract gives no indication they checked whether post-decision continuations remain matched or that any divergence is orthogonal to the outcome signal.

This is useful for people already working on tool-using agents who need tighter control over when skills fire. The empirical gains are large enough that a practitioner could test the method on their own setup. It deserves a serious referee because the core idea is well-motivated, the experiments are reported with specific numbers, and the potential confound is fixable with additional analysis rather than fatal to the work.

Referee Report

1 major / 2 minor

Summary. The paper proposes SelSkill, a dual-granularity preference-learning framework for selective skill invocation in agentic tasks. It formulates invocation as a skill-or-skip decision, uses predictive uncertainty to prioritize points, constructs invoke-skip pairs from shared trajectory prefixes, and combines episode-level outcome preferences with step-level invocation preferences via DPO. Reported results include +10.9 pp task success and +29.1 pp execution precision on ALFWorld with Qwen3-8B, +5.7 pp and +29.5 pp on BFCL, plus zero-shot transfer to Tau-bench and PopQA.

Significance. If the shared-prefix construction isolates the causal effect of the invocation decision, the work addresses a practically important gap in agent skill use by learning when not to invoke. The scale of the reported gains on two benchmarks would be notable for agent reliability, but the absence of verification that downstream state changes do not confound the preference signal limits the strength of the central claim.

major comments (1)

[Method (preference pair construction)] The shared-prefix preference pair construction (described in the method) assumes the only systematic difference between invoke and skip continuations is the binary invocation choice. In ALFWorld and similar environments, however, a skill call injects new observations and updates the belief state, so post-decision trajectories are unlikely to remain matched. No check is reported that any divergence is orthogonal to the reward signal; therefore the learned policy may attribute downstream trajectory quality to the local decision rather than isolating the invocation effect. This assumption is load-bearing for the claim that dual-granularity DPO yields a clean skill-or-skip policy.

minor comments (2)

[Experiments] Abstract and results sections report point improvements but omit number of runs, standard deviations, statistical tests, and full baseline comparisons; these details are needed to assess whether the gains are robust.
[Method] Notation for the two granularity levels (episode vs. step) and how the combined loss is formed should be stated explicitly with an equation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment raises a valid methodological point about potential confounding in our preference pair construction. We address it below and outline planned revisions.

read point-by-point responses

Referee: [Method (preference pair construction)] The shared-prefix preference pair construction (described in the method) assumes the only systematic difference between invoke and skip continuations is the binary invocation choice. In ALFWorld and similar environments, however, a skill call injects new observations and updates the belief state, so post-decision trajectories are unlikely to remain matched. No check is reported that any divergence is orthogonal to the reward signal; therefore the learned policy may attribute downstream trajectory quality to the local decision rather than isolating the invocation effect. This assumption is load-bearing for the claim that dual-granularity DPO yields a clean skill-or-skip policy.

Authors: We agree that the shared-prefix construction controls the state only up to the decision point and that skill invocation necessarily alters subsequent observations and belief state. The preference signal is derived from comparing final outcomes (episode-level) or immediate step rewards (invocation-level) after the choice, which is intended to attribute quality to the invocation decision itself. However, no explicit verification that post-decision divergences are orthogonal to the reward is reported in the current manuscript. To address this, we will add an analysis section in the revision that quantifies state divergence (e.g., via embedding similarity or observation overlap) between invoke and skip branches and checks its correlation with the preference label independent of the invocation choice. This will strengthen the isolation claim while preserving the dual-granularity DPO framework. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results independent of fitted inputs or self-referential definitions

full rationale

The paper reports task success and execution precision gains on ALFWorld and BFCL benchmarks using a dual-granularity preference learning approach that constructs invoke-skip pairs from shared prefixes and combines episode- and step-level DPO. These are external empirical outcomes with no equations, predictions, or first-principles claims that reduce by construction to the paper's own fitted parameters, self-citations, or input definitions. The method is self-contained against held-out benchmarks and zero-shot transfer tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters or invented entities are described.

axioms (2)

domain assumption Predictive uncertainty can be used to prioritize decision points where skill invocation matters.
Used to select candidate points for preference pair construction.
domain assumption Preference pairs from shared trajectory prefixes isolate the local effect of invocation versus skip.
Central to creating controlled invoke-skip training data.

pith-pipeline@v0.9.1-grok · 5767 in / 1254 out tokens · 35482 ms · 2026-06-28T19:08:34.628950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 39 canonical work pages · 19 internal anchors

[1]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu and Zhiyuan Yao and Jinyang Wu and Chengcheng Han and Qi Gu and Xunliang Cai and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.02268 , eprinttype =. 2604.02268 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02268 2026
[2]

2026 , eprint=

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning , author=. 2026 , eprint=

2026
[3]

2026 , eprint=

Dynamic Dual-Granularity Skill Bank for Agentic RL , author=. 2026 , eprint=

2026
[4]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni and Yihao Liu and Xinpeng Liu and Yutao Sun and Mengyu Zhou and Pengyu Cheng and Dexin Wang and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.25158 , eprinttype =. 2603.25158 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.25158 2026
[5]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2603.02176 , year=

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author=. arXiv preprint arXiv:2603.02176 , year=

work page arXiv
[7]

arXiv preprint arXiv:2509.23285 , year=

Yifei Chen and Guanting Dong and Zhicheng Dou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.23285 , eprinttype =. 2509.23285 , timestamp =

work page doi:10.48550/arxiv.2509.23285 2025
[8]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[9]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the ACM on Web Conference 2025 , pages=

Tool learning in the wild: Empowering language models as automatic tool agents , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025
[11]

arXiv preprint arXiv:2509.26490 , year=

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications , author=. arXiv preprint arXiv:2509.26490 , year=

work page arXiv
[12]

arXiv preprint arXiv:2508.04865 , year=

Agnostics: Learning to code in any programming language via reinforcement with a universal learning environment , author=. arXiv preprint arXiv:2508.04865 , year=

work page arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[14]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
[15]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Agent laboratory: Using llm agents as research assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , publisher=

2025
[16]

Andrew Zhao and Daniel Huang and Quentin Xu and Matthieu Lin and Yong. ExpeL:. Thirty-Eighth. 2024 , url =. doi:10.1609/AAAI.V38I17.29936 , timestamp =

work page doi:10.1609/aaai.v38i17.29936 2024
[17]

Advances in neural information processing systems , volume=

Hindsight credit assignment , author=. Advances in neural information processing systems , volume=
[18]

arXiv preprint arXiv:2511.12159 , year=

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic , author=. arXiv preprint arXiv:2511.12159 , year=

work page arXiv
[19]

Proceedings of the AAAI conference on artificial intelligence , volume=

The option-critic architecture , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[20]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[21]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=
[22]

arXiv preprint arXiv:2511.10395 , year=

Agentevolver: Towards efficient self-evolving agent system , author=. arXiv preprint arXiv:2511.10395 , year=

work page arXiv
[23]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[24]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =. 2021 , url =

2021
[25]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

SimpleMem: Efficient Lifelong Memory for LLM Agents

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. arXiv preprint arXiv:2601.02553 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[30]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang and Delong Li and Haiyu Deng and Baihe Ma and Xu Wang and Qin Wang and Guangsheng Yu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.20867 , eprinttype =. 2602.20867 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867 2026
[31]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang and Zhuoyun Yu and Xin Xie and Wuguannan Yao and Runnan Fang and Shuofei Qiao and Kexin Cao and Guozhou Zheng and Xiang Qi and Peng Zhang and Shumin Deng , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.04804 , eprinttype =. 2604.04804 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04804 2026
[32]

CoRR , volume =

Yanzhao Zheng and ZhenTao Zhang and Chao Ma and YuanQiang Yu and JiHuai Zhu and Yong Wu and Tianze Xu and Baohua Dong and Hangcheng Zhu and Ruohui Huang and Gang Yu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22455 , eprinttype =. 2603.22455 , timestamp =

work page doi:10.48550/arxiv.2603.22455 2026
[33]

arXiv preprint arXiv:2603.04448 , year=

Yuan Liang and Ruobin Zhong and Haoming Xu and Chen Jiang and Yi Zhong and Runnan Fang and Jia. SkillNet: Create, Evaluate, and Connect. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.04448 , eprinttype =. 2603.04448 , timestamp =

work page doi:10.48550/arxiv.2603.04448 2026
[34]

arXiv preprint arXiv:2601.21123 , year=

Cua-skill: Develop skills for computer using agent , author=. arXiv preprint arXiv:2601.21123 , year=

work page arXiv
[35]

2026 , eprint=

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents , author=. 2026 , eprint=

2026
[36]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023
[37]

Patil and Tianjun Zhang and Xin Wang and Joseph E

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , editor =. Gorilla: Large Language Model Connected with Massive APIs , booktitle =. 2024 , url =

2024
[38]

CoRR , volume =

Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.00718 , eprinttype =. 2603.00718 , timestamp =

work page doi:10.48550/arxiv.2603.00718 2026
[39]

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu and Zongxia Li and Hongyang Du and Xiyang Wu and Shihang Gui and Yongbei Kuang and Lichao Sun , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.05333 , eprinttype =. 2604.05333 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05333 2026
[40]

CoRR , volume =

Yutao Yang and Junsong Li and Qianjun Pan and Bihao Zhan and Yuxuan Cai and Lin Du and Jie Zhou and Kai Chen and Qin Chen and Xin Li and Bo Zhang and Liang He , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.01145 , eprinttype =. 2603.01145 , timestamp =

work page doi:10.48550/arxiv.2603.01145 2026
[41]

2026 , eprint=

SkillOS: Learning Skill Curation for Self-Evolving Agents , author=. 2026 , eprint=

2026
[42]

Skill Retrieval Augmentation for Agentic AI

Weihang Su and Jianming Long and Qingyao Ai and Yichen Tang and Changyue Wang and Yiteng Tu and Yiqun Liu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.24594 , eprinttype =. 2604.24594 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.24594 2026
[43]

arXiv preprint arXiv:2509.23285 , year=

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning , author=. arXiv preprint arXiv:2509.23285 , year=

work page arXiv
[44]

When2Call: When (not) to Call Tools , booktitle =

Hayley Ross and Ameya Sunil Mahabaleshwarkar and Yoshi Suhara , editor =. When2Call: When (not) to Call Tools , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.174 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.174 2025
[45]

Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation , booktitle =

Yi. Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-INDUSTRY.9 , timestamp =

work page doi:10.18653/v1/2025.naacl-industry.9 2025
[46]

Alignment for Efficient Tool Calling of Large Language Models , booktitle =

Hongshen Xu and Zihan Wang and Zichen Zhu and Lei Pan and Xingyu Chen and Shuai Fan and Lu Chen and Kai Yu , editor =. Alignment for Efficient Tool Calling of Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.898 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.898 2025
[47]

2025 , eprint=

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior , author=. 2025 , eprint=

2025
[48]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang and Nan Yang and Xiaolong Huang and Binxing Jiao and Linjun Yang and Daxin Jiang and Rangan Majumder and Furu Wei , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2212.03533 , eprinttype =. 2212.03533 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03533 2022
[49]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[50]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023
[51]

2026 , eprint=

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning , author=. 2026 , eprint=

2026
[52]

arXiv e-prints , keywords =

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety. arXiv e-prints , keywords =. doi:10.48550/arXiv.2505.20065 , archivePrefix =. 2505.20065 , primaryClass =

work page doi:10.48550/arxiv.2505.20065
[53]

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs. arXiv e-prints , keywords =. doi:10.48550/arXiv.2506.10054 , archivePrefix =. 2506.10054 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.10054
[54]

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications , author=. arXiv preprint arXiv:2605.07358 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Agentic Reinforced Policy Optimization

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji. Agentic Reinforced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.19849 , eprinttype =. 2507.19849 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19849 2025
[56]

CoRR , volume =

George Ling and Shanshan Zhong and Richard Huang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.08004 , eprinttype =. 2602.08004 , timestamp =

work page doi:10.48550/arxiv.2602.08004 2026
[57]

arXiv preprint arXiv:2504.06821 , year=

Zora Zhiruo Wang and Apurva Gandhi and Graham Neubig and Daniel Fried , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.06821 , eprinttype =. 2504.06821 , timestamp =

work page doi:10.48550/arxiv.2504.06821 2025
[58]

Agent Workflow Memory , booktitle =

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , editor =. Agent Workflow Memory , booktitle =. 2025 , url =

2025
[59]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo and Weizhi Zhang and Ye Yuan and Yusheng Zhao and Junwei Yang and Yiyang Gu and Bohan Wu and Binqi Chen and Ziyue Qiao and Qingqing Long and Rongcheng Tu and Xiao Luo and Wei Ju and Zhiping Xiao and Yifan Wang and Meng Xiao and Chenwu Liu and Jingyang Yuan and Shichang Zhang and Yiqiao Jin and Fan Zhang and Xian Wu and Hanqing Zhao and Dacheng T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.21460 2025
[60]

Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng

Shishir G. Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng. The Berkeley Function Calling Leaderboard. Forty-second International Conference on Machine Learning,. 2025 , url =

2025
[61]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.12045 , eprinttype =. 2406.12045 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[62]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen and Akari Asai and Victor Zhong and Rajarshi Das and Daniel Khashabi and Hannaneh Hajishirzi , editor =. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.546 , timestamp =

work page doi:10.18653/v1/2023.acl-long.546 2023
[63]

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai and Jie Zhou and Qin Chen and Liang He , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.20572 , eprinttype =. 2604.20572 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.20572 2026
[64]

2026 , eprint=

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks , author=. 2026 , eprint=

2026

[1] [1]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu and Zhiyuan Yao and Jinyang Wu and Chengcheng Han and Qi Gu and Xunliang Cai and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.02268 , eprinttype =. 2604.02268 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02268 2026

[2] [2]

2026 , eprint=

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning , author=. 2026 , eprint=

2026

[3] [3]

2026 , eprint=

Dynamic Dual-Granularity Skill Bank for Agentic RL , author=. 2026 , eprint=

2026

[4] [4]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni and Yihao Liu and Xinpeng Liu and Yutao Sun and Mengyu Zhou and Pengyu Cheng and Dexin Wang and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.25158 , eprinttype =. 2603.25158 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.25158 2026

[5] [5]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent skills for large language models: Architecture, acquisition, security, and the path forward , author=. arXiv preprint arXiv:2602.12430 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2603.02176 , year=

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author=. arXiv preprint arXiv:2603.02176 , year=

work page arXiv

[7] [7]

arXiv preprint arXiv:2509.23285 , year=

Yifei Chen and Guanting Dong and Zhicheng Dou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.23285 , eprinttype =. 2509.23285 , timestamp =

work page doi:10.48550/arxiv.2509.23285 2025

[8] [8]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

[9] [9]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Proceedings of the ACM on Web Conference 2025 , pages=

Tool learning in the wild: Empowering language models as automatic tool agents , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025

[11] [11]

arXiv preprint arXiv:2509.26490 , year=

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications , author=. arXiv preprint arXiv:2509.26490 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2508.04865 , year=

Agnostics: Learning to code in any programming language via reinforcement with a universal learning environment , author=. arXiv preprint arXiv:2508.04865 , year=

work page arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

[15] [15]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Agent laboratory: Using llm agents as research assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , publisher=

2025

[16] [16]

Andrew Zhao and Daniel Huang and Quentin Xu and Matthieu Lin and Yong. ExpeL:. Thirty-Eighth. 2024 , url =. doi:10.1609/AAAI.V38I17.29936 , timestamp =

work page doi:10.1609/aaai.v38i17.29936 2024

[17] [17]

Advances in neural information processing systems , volume=

Hindsight credit assignment , author=. Advances in neural information processing systems , volume=

[18] [18]

arXiv preprint arXiv:2511.12159 , year=

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic , author=. arXiv preprint arXiv:2511.12159 , year=

work page arXiv

[19] [19]

Proceedings of the AAAI conference on artificial intelligence , volume=

The option-critic architecture , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[20] [20]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[21] [21]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

[22] [22]

arXiv preprint arXiv:2511.10395 , year=

Agentevolver: Towards efficient self-evolving agent system , author=. arXiv preprint arXiv:2511.10395 , year=

work page arXiv

[23] [23]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992

[24] [24]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =. 2021 , url =

2021

[25] [25]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

SimpleMem: Efficient Lifelong Memory for LLM Agents

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. arXiv preprint arXiv:2601.02553 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[30] [30]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang and Delong Li and Haiyu Deng and Baihe Ma and Xu Wang and Qin Wang and Guangsheng Yu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.20867 , eprinttype =. 2602.20867 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867 2026

[31] [31]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang and Zhuoyun Yu and Xin Xie and Wuguannan Yao and Runnan Fang and Shuofei Qiao and Kexin Cao and Guozhou Zheng and Xiang Qi and Peng Zhang and Shumin Deng , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.04804 , eprinttype =. 2604.04804 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04804 2026

[32] [32]

CoRR , volume =

Yanzhao Zheng and ZhenTao Zhang and Chao Ma and YuanQiang Yu and JiHuai Zhu and Yong Wu and Tianze Xu and Baohua Dong and Hangcheng Zhu and Ruohui Huang and Gang Yu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22455 , eprinttype =. 2603.22455 , timestamp =

work page doi:10.48550/arxiv.2603.22455 2026

[33] [33]

arXiv preprint arXiv:2603.04448 , year=

Yuan Liang and Ruobin Zhong and Haoming Xu and Chen Jiang and Yi Zhong and Runnan Fang and Jia. SkillNet: Create, Evaluate, and Connect. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.04448 , eprinttype =. 2603.04448 , timestamp =

work page doi:10.48550/arxiv.2603.04448 2026

[34] [34]

arXiv preprint arXiv:2601.21123 , year=

Cua-skill: Develop skills for computer using agent , author=. arXiv preprint arXiv:2601.21123 , year=

work page arXiv

[35] [35]

2026 , eprint=

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents , author=. 2026 , eprint=

2026

[36] [36]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023

[37] [37]

Patil and Tianjun Zhang and Xin Wang and Joseph E

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , editor =. Gorilla: Large Language Model Connected with Massive APIs , booktitle =. 2024 , url =

2024

[38] [38]

CoRR , volume =

Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.00718 , eprinttype =. 2603.00718 , timestamp =

work page doi:10.48550/arxiv.2603.00718 2026

[39] [39]

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu and Zongxia Li and Hongyang Du and Xiyang Wu and Shihang Gui and Yongbei Kuang and Lichao Sun , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.05333 , eprinttype =. 2604.05333 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05333 2026

[40] [40]

CoRR , volume =

Yutao Yang and Junsong Li and Qianjun Pan and Bihao Zhan and Yuxuan Cai and Lin Du and Jie Zhou and Kai Chen and Qin Chen and Xin Li and Bo Zhang and Liang He , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.01145 , eprinttype =. 2603.01145 , timestamp =

work page doi:10.48550/arxiv.2603.01145 2026

[41] [41]

2026 , eprint=

SkillOS: Learning Skill Curation for Self-Evolving Agents , author=. 2026 , eprint=

2026

[42] [42]

Skill Retrieval Augmentation for Agentic AI

Weihang Su and Jianming Long and Qingyao Ai and Yichen Tang and Changyue Wang and Yiteng Tu and Yiqun Liu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.24594 , eprinttype =. 2604.24594 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.24594 2026

[43] [43]

arXiv preprint arXiv:2509.23285 , year=

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning , author=. arXiv preprint arXiv:2509.23285 , year=

work page arXiv

[44] [44]

When2Call: When (not) to Call Tools , booktitle =

Hayley Ross and Ameya Sunil Mahabaleshwarkar and Yoshi Suhara , editor =. When2Call: When (not) to Call Tools , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.174 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.174 2025

[45] [45]

Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation , booktitle =

Yi. Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-INDUSTRY.9 , timestamp =

work page doi:10.18653/v1/2025.naacl-industry.9 2025

[46] [46]

Alignment for Efficient Tool Calling of Large Language Models , booktitle =

Hongshen Xu and Zihan Wang and Zichen Zhu and Lei Pan and Xingyu Chen and Shuai Fan and Lu Chen and Kai Yu , editor =. Alignment for Efficient Tool Calling of Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.898 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.898 2025

[47] [47]

2025 , eprint=

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior , author=. 2025 , eprint=

2025

[48] [48]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang and Nan Yang and Xiaolong Huang and Binxing Jiao and Linjun Yang and Daxin Jiang and Rangan Majumder and Furu Wei , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2212.03533 , eprinttype =. 2212.03533 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03533 2022

[49] [49]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[50] [50]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023

[51] [51]

2026 , eprint=

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning , author=. 2026 , eprint=

2026

[52] [52]

arXiv e-prints , keywords =

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety. arXiv e-prints , keywords =. doi:10.48550/arXiv.2505.20065 , archivePrefix =. 2505.20065 , primaryClass =

work page doi:10.48550/arxiv.2505.20065

[53] [53]

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs. arXiv e-prints , keywords =. doi:10.48550/arXiv.2506.10054 , archivePrefix =. 2506.10054 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.10054

[54] [54]

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications , author=. arXiv preprint arXiv:2605.07358 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Agentic Reinforced Policy Optimization

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji. Agentic Reinforced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.19849 , eprinttype =. 2507.19849 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19849 2025

[56] [56]

CoRR , volume =

George Ling and Shanshan Zhong and Richard Huang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.08004 , eprinttype =. 2602.08004 , timestamp =

work page doi:10.48550/arxiv.2602.08004 2026

[57] [57]

arXiv preprint arXiv:2504.06821 , year=

Zora Zhiruo Wang and Apurva Gandhi and Graham Neubig and Daniel Fried , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.06821 , eprinttype =. 2504.06821 , timestamp =

work page doi:10.48550/arxiv.2504.06821 2025

[58] [58]

Agent Workflow Memory , booktitle =

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , editor =. Agent Workflow Memory , booktitle =. 2025 , url =

2025

[59] [59]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo and Weizhi Zhang and Ye Yuan and Yusheng Zhao and Junwei Yang and Yiyang Gu and Bohan Wu and Binqi Chen and Ziyue Qiao and Qingqing Long and Rongcheng Tu and Xiao Luo and Wei Ju and Zhiping Xiao and Yifan Wang and Meng Xiao and Chenwu Liu and Jingyang Yuan and Shichang Zhang and Yiqiao Jin and Fan Zhang and Xian Wu and Hanqing Zhao and Dacheng T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.21460 2025

[60] [60]

Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng

Shishir G. Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng. The Berkeley Function Calling Leaderboard. Forty-second International Conference on Machine Learning,. 2025 , url =

2025

[61] [61]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.12045 , eprinttype =. 2406.12045 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024

[62] [62]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen and Akari Asai and Victor Zhong and Rajarshi Das and Daniel Khashabi and Hannaneh Hajishirzi , editor =. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.546 , timestamp =

work page doi:10.18653/v1/2023.acl-long.546 2023

[63] [63]

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai and Jie Zhou and Qin Chen and Liang He , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2604.20572 , eprinttype =. 2604.20572 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.20572 2026

[64] [64]

2026 , eprint=

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks , author=. 2026 , eprint=

2026