hub

ToolACE : Winning the points of LLM function calling

Weiwen Liu, Xu Zeng, Jian Jiang, 1 others · 2025 · arXiv 2409.00920

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

read on arXiv browse 27 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

SkillWeaver formalizes compositional skill routing for LLM agents and introduces SAD, which raises step-level decomposition accuracy from 51% to 67.7% on a new 300-query benchmark over 2209 real MCP skills.

Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Trajectories from a Bittensor ShoppingBench subnet arena, filtered to retain only agentic tool-calling behavior, enable SFT+GRPO post-training of Qwen3-4B to 42.7% ASR on leak-guarded held-out tests, nearly matching synthetic-data baselines with a fraction of a day's data.

Cybersecurity AI (CAI) Dataset

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

CAI Dataset is presented as the largest described corpus of LLM-driven hacker trajectories, with the claim that operator data concentration in frontier-model providers creates a major security risk best addressed by on-premise specialized LLMs.

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

cs.LG · 2026-05-10 · unverdicted · novelty 7.0 · 3 refs

RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Q-ARE: An Evaluation Dataset for Query Based API Recommendation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

Q-ARE dataset and metrics reveal that existing API recommendation methods and LLMs degrade sharply on multi-level invocation chains.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

cs.AI · 2025-10-16 · unverdicted · novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.

SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

cs.MA · 2026-06-10 · unverdicted · novelty 6.0

SAIGuard uses communication-state simulation on the MAS interaction graph to detect and sanitize risky messages via reconstruction deviations, reducing attack success while preserving utility.

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SMH-Bench supplies 1,100 stratified tasks in a verifiable smart-home simulator to measure LLM performance on explicit control, scheduling, ambiguity, and personalization as environment complexity grows.

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

ProAct uses idle compute to anticipate user needs via dialogue history and memory, achieving 14.8% fewer turns, 11.7% less user effort, and 28.1% fewer hallucinations than reactive baselines on the new ProActEval benchmark.

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

cs.SE · 2026-04-10 · unverdicted · novelty 6.0

A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

cs.AI · 2026-04-02 · unverdicted · novelty 6.0 · 2 refs

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

cs.CL · 2026-01-08 · unverdicted · novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

ToolRL: Reward is All Tool Learning Needs

cs.LG · 2025-04-16 · conditional · novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

SenseWalk: Agent-Based Semantic Trajectory Simulation Powered by Large Language Models in Zoned Environments

cs.HC · 2026-07-01 · unverdicted · novelty 5.0

SenseWalk is an LLM-powered agent-based simulation system for semantic trajectories that combines LLMs with the social force model, supported by a user interface, quantitative evaluation, and a user study with 12 participants.

Understanding How Enterprises Adopt the Model Context Protocol for LLM-Driven Software Engineering

cs.SE · 2026-06-08 · unverdicted · novelty 5.0

Interviews with 20 practitioners show MCP supports cross-system collaboration and task decoupling in LLM workflows but is limited by ecosystem fragmentation, coordination issues, and state management problems.

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

cs.CL · 2026-05-14 · unverdicted · novelty 5.0

AsyncFC decouples LLM decoding from function execution via symbolic futures, enabling overlap and parallelism to reduce end-to-end latency on function-calling benchmarks while preserving accuracy.

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

cs.AI · 2026-04-08 · unverdicted · novelty 5.0

LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

cs.AI · 2026-03-16 · unverdicted · novelty 5.0

ALTK supplies reusable middleware components that systematically address failure modes across the full AI agent lifecycle from request to response.

Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

cs.CV · 2025-09-23 · unverdicted · novelty 5.0

Structured reflection makes error diagnosis and repair an explicit trainable step that improves reliability and reduces redundant calls in tool-using LLM agents.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SenseWalk: Agent-Based Semantic Trajectory Simulation Powered by Large Language Models in Zoned Environments cs.HC · 2026-07-01 · unverdicted · none · ref 34
SenseWalk is an LLM-powered agent-based simulation system for semantic trajectories that combines LLMs with the social force model, supported by a user interface, quantitative evaluation, and a user study with 12 participants.

ToolACE : Winning the points of LLM function calling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer