hub

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author= · 2025 · arXiv 2504.03601

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

cs.CL · 2026-06-09 · conditional · novelty 7.0

ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

cs.AI · 2025-06-09 · unverdicted · novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.

Scaling Agentic Capabilities via Grounded Interaction Synthesis

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

GAIS synthesizes diverse, high-fidelity agentic tasks from real-world MCP servers and adversarial planning, outperforming LLM-only baselines on BFCL, τ²-Bench, and ACEBench with greater data efficiency.

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

cs.CL · 2026-04-03 · conditional · novelty 6.0

ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

cs.LG · 2026-03-25 · unverdicted · novelty 6.0

A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

cs.CL · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

cs.CL · 2025-11-26 · unverdicted · novelty 5.0

Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

cs.AI · 2025-08-10 · unverdicted · novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

citing papers explorer

Showing 10 of 10 citing papers after filters.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CL · 2026-06-11 · unverdicted · none · ref 1
SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories cs.CL · 2026-06-09 · conditional · none · ref 54
ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills cs.CL · 2026-05-20 · unverdicted · none · ref 5
Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments cs.CL · 2026-06-02 · unverdicted · none · ref 12
PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.
Scaling Agentic Capabilities via Grounded Interaction Synthesis cs.CL · 2026-06-01 · unverdicted · none · ref 20
GAIS synthesizes diverse, high-fidelity agentic tasks from real-world MCP servers and adversarial planning, outperforming LLM-only baselines on BFCL, τ²-Bench, and ACEBench with greater data efficiency.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification cs.CL · 2026-05-05 · unverdicted · none · ref 43
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues cs.CL · 2026-04-03 · conditional · none · ref 1
ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 268
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unverdicted · none · ref 13 · 2 links
Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework cs.CL · 2025-11-26 · unverdicted · none · ref 13
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer