SkillWeaver formalizes compositional skill routing for LLM agents and introduces SAD, which raises step-level decomposition accuracy from 51% to 67.7% on a new 300-query benchmark over 2209 real MCP skills.
hub
ToolACE : Winning the points of LLM function calling
27 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Trajectories from a Bittensor ShoppingBench subnet arena, filtered to retain only agentic tool-calling behavior, enable SFT+GRPO post-training of Qwen3-4B to 42.7% ASR on leak-guarded held-out tests, nearly matching synthetic-data baselines with a fraction of a day's data.
CAI Dataset is presented as the largest described corpus of LLM-driven hacker trajectories, with the claim that operator data concentration in frontier-model providers creates a major security risk best addressed by on-premise specialized LLMs.
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Q-ARE dataset and metrics reveal that existing API recommendation methods and LLMs degrade sharply on multi-level invocation chains.
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
SAIGuard uses communication-state simulation on the MAS interaction graph to detect and sanitize risky messages via reconstruction deviations, reducing attack success while preserving utility.
SMH-Bench supplies 1,100 stratified tasks in a verifiable smart-home simulator to measure LLM performance on explicit control, scheduling, ambiguity, and personalization as environment complexity grows.
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
ProAct uses idle compute to anticipate user needs via dialogue history and memory, achieving 14.8% fewer turns, 11.7% less user effort, and 28.1% fewer hallucinations than reactive baselines on the new ProActEval benchmark.
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
SenseWalk is an LLM-powered agent-based simulation system for semantic trajectories that combines LLMs with the social force model, supported by a user interface, quantitative evaluation, and a user study with 12 participants.
Interviews with 20 practitioners show MCP supports cross-system collaboration and task decoupling in LLM workflows but is limited by ecosystem fragmentation, coordination issues, and state management problems.
AsyncFC decouples LLM decoding from function execution via symbolic futures, enabling overlap and parallelism to reduce end-to-end latency on function-calling benchmarks while preserving accuracy.
LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
ALTK supplies reusable middleware components that systematically address failure modes across the full AI agent lifecycle from request to response.
Structured reflection makes error diagnosis and repair an explicit trainable step that improves reliability and reduces redundant calls in tool-using LLM agents.
citing papers explorer
-
SenseWalk: Agent-Based Semantic Trajectory Simulation Powered by Large Language Models in Zoned Environments
SenseWalk is an LLM-powered agent-based simulation system for semantic trajectories that combines LLMs with the social force model, supported by a user interface, quantitative evaluation, and a user study with 12 participants.