A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.
A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
citing papers explorer
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
-
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
-
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues
ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.
-
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.