arxiv: 2604.09813 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: unknown

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

Siyuan Xu , Shiyang Li , Xin Liu , Tianyi Liu , Yixiao Li , Zhan Shi , Zixuan Zhang , Zilong Wang

show 4 more authors

Qingyu Yin Jianshu Chen Tuo Zhao Bing Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-use agentssynthetic data generationreinforcement learningoracle-preserving augmentationagentic RLverifiable environmentsdata synthesis pipeline

0 comments

The pith

A two-stage pipeline generates verifiable synthetic environments for RL that improve agent tool-use robustness under ambiguity and noise while preserving ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COVERT to address the mismatch between typical offline synthetic tool-use data and the interactive, reward-checkable needs of reinforcement learning for agents. It first builds reliable base trajectories using self-evolving synthesis and multi-level validation, then applies targeted augmentations that add distractor tools, indirect queries, and unreliable outputs without altering the correct tool calls or final answers. This design supports automatic reward signals through reference matching or lightweight verification, allowing RL to optimize policies for handling real-world messiness. Experiments show accuracy gains on tool-use benchmarks when the method is used for RL, with further improvements when combined with prior supervised fine-tuning and little impact on general capabilities. A reader would care because it offers a concrete way to move beyond static data into online refinement for more reliable agent behavior.

Core claim

COVERT is a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity by introducing distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth to enable automatic reward computation via reference matching and lightweight judge-assisted verification for RL optimization of tool-calling policies.

What carries the argument

The oracle-preserving augmentation stage, which adds distractors, ambiguity, and feedback noise while keeping the original correct tool sequence and answer fixed as the reference for reward signals.

If this is right

Enables standard RL algorithms to use automatic rewards for most tool calls via exact reference matching.
Supports optimization for special cases like error detection through lightweight judge verification.
Delivers additive performance gains on tool-use benchmarks when applied after supervised fine-tuning.
Maintains general model capabilities with minimal regression while targeting robustness under noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controllable synthesis pattern could extend to other interactive agent domains where ground-truth actions can be isolated from environmental noise.
Preserving an oracle while varying surface complexity offers a route to test whether RL policies generalize better than those trained only on clean data.
If the validation steps scale efficiently, this method reduces reliance on costly human-curated interactive traces for agent training.

Load-bearing premise

Multi-level validation produces high-quality base trajectories and the chosen augmentations increase complexity without introducing systematic biases that would make the preserved oracle unreliable as ground truth for reward computation.

What would settle it

Running RL on the generated environments yields no net accuracy gain or causes regressions specifically on held-out tasks with ambiguous queries and unreliable tool outputs, compared to the base model after supervised fine-tuning alone.

Figures

Figures reproduced from arXiv: 2604.09813 by Bing Yin, Jianshu Chen, Qingyu Yin, Shiyang Li, Siyuan Xu, Tianyi Liu, Tuo Zhao, Xin Liu, Yixiao Li, Zhan Shi, Zilong Wang, Zixuan Zhang.

**Figure 2.** Figure 2: Reliable tool-use trajectory generation pipeline (Stage I) with diverse prompt [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of oracle-preserving augmentation. Starting from a simple and reliable [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: System prompt used for evaluation. B Supplementary Results [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Training reward curves for COVERT-RL initialized from Qwen2.5-Instruct (blue) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Examples 1: Raw tool-calling data examples of layered symbolic reasoning. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Examples 2: Raw tool-calling data example of parallel tool calling. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Examples 3: Raw tool-calling data example of multi-turn tool calling. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Raw case-study conversations (Case 1) comparing base vs. COVERT-RL models. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Raw case-study conversations (Case 2) comparing base vs. COVERT-RL models. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COVERT gives a workable two-stage pipeline for RL-ready tool-use data by validating base trajectories then adding oracle-preserving noise and ambiguity, with modest benchmark gains that stack on SFT.

read the letter

The paper's main contribution is a concrete pipeline called COVERT that first creates reliable tool-use trajectories via self-evolving synthesis and multi-level checks, then applies targeted augmentations such as distractor tools, vague queries, and noisy or erroneous outputs while keeping the original oracle calls and answers intact for reward computation. This setup supports automatic reference matching for most cases and light judge checks for error detection, which directly addresses the need for executable environments in agentic RL rather than offline SFT data alone. The reported numbers show gains on public benchmarks like BFCL v3 and ACEBench, plus further improvement when combined with SFT and little regression on general tasks, which suggests the RL stage adds distinct robustness training. That additive pattern is the clearest positive signal in the abstract. The soft spots sit in the missing mechanics. The abstract does not spell out the exact validation criteria, how many trajectories survive each stage, the parameters for noise injection, or any controls for statistical significance, so it is difficult to gauge how robust the data quality really is. The stress-test point about potential reward misalignment also lands: when the environment feeds the agent contradictory or erroneous outputs, matching against the hidden oracle could credit correct calls even when the policy fails to handle real unreliability, and nothing in the summary shows a fix for that. This work is aimed at groups training tool-calling agents with RL. Readers who need practical recipes for scaling synthetic environments with verifiable rewards will find the engineering choices useful even if the lifts stay incremental. The external benchmarks and additive results give it enough grounding to deserve a serious referee rather than a desk reject. I would recommend sending it to peer review, with the main questions likely centering on the augmentation details and reward consistency under noise.

Referee Report

2 major / 1 minor

Summary. The manuscript presents COVERT, a two-stage pipeline for synthesizing tool-use trajectories suitable for agentic reinforcement learning. The first stage uses self-evolving synthesis with multi-level validation to create reliable base trajectories, while the second stage applies oracle-preserving augmentations—including distractor tools, ambiguous queries, and noisy tool outputs—to increase complexity without altering the ground-truth tool calls and answers. This setup supports automatic reward computation for RL. Experiments demonstrate accuracy improvements on BFCL v3 and ACEBench for the Qwen2.5-Instruct-14B model, with additive benefits when combined with supervised fine-tuning and limited regressions on general benchmarks.

Significance. If the oracle preservation and validation steps hold without introducing systematic biases, this work provides a practical method for generating RL environments that address the limitations of existing SFT-focused synthetic corpora. The concrete performance lifts and the demonstration of complementarity with SFT are positive indicators. The evaluation on external public benchmarks rather than self-referential metrics strengthens the claims.

major comments (2)

[Augmentation stage] The central mechanism of oracle-preserving augmentations for noisy outputs is not sufficiently detailed to confirm that reference matching produces rewards aligned with the agent's observations. Without explicit handling for how erroneous outputs affect the preserved oracle, there is a risk that the RL signal reinforces correct calls despite contradictory feedback, failing to teach noise detection as intended.
[Experimental results] The benchmark results lack reporting of statistical significance, variance across runs, or specific controls for augmentation parameters, which are necessary to substantiate the robustness improvements claimed.

minor comments (1)

[Abstract] Consider adding a brief explanation or citation for 'self-evolving synthesis' to improve accessibility for readers unfamiliar with the term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for clarification and strengthening of the claims. We address each major comment point-by-point below and commit to revisions that will improve the paper without altering its core contributions.

read point-by-point responses

Referee: [Augmentation stage] The central mechanism of oracle-preserving augmentations for noisy outputs is not sufficiently detailed to confirm that reference matching produces rewards aligned with the agent's observations. Without explicit handling for how erroneous outputs affect the preserved oracle, there is a risk that the RL signal reinforces correct calls despite contradictory feedback, failing to teach noise detection as intended.

Authors: We appreciate the referee's focus on the noisy output augmentation, as this is central to teaching robustness. In the COVERT pipeline, oracle preservation fixes the ground-truth tool calls and final answers, enabling reference matching for reward in standard trajectories. For the specific case of noisy/erroneous tool outputs (which simulate unreliable feedback), the design intentionally uses lightweight judge-assisted verification for special behaviors such as error detection, rather than pure reference matching on the final answer. This ensures the RL signal rewards correct handling of noise (e.g., detecting errors and recovering) without reinforcing calls that ignore contradictory observations. We acknowledge that the manuscript's description of this reward alignment could be more explicit, including how the preserved oracle interacts with observed noisy outputs. We will revise the relevant sections (likely Section 3.2 and the reward computation paragraph) to add detailed examples, pseudocode, and clarification on when judge-assisted verification is triggered versus reference matching. revision: yes
Referee: [Experimental results] The benchmark results lack reporting of statistical significance, variance across runs, or specific controls for augmentation parameters, which are necessary to substantiate the robustness improvements claimed.

Authors: We agree that the current experimental reporting is insufficient to fully substantiate robustness. The reported gains (e.g., +3.4 on BFCL v3, +6.3 on ACEBench) are from single runs, and we did not include variance or statistical tests. We will revise the experimental section to include: (1) results from multiple independent runs with different random seeds, reporting means and standard deviations; (2) statistical significance tests (e.g., paired t-tests or bootstrap) comparing COVERT-RL against baselines; and (3) controls/ablation studies on key augmentation parameters such as noise injection rate, number of distractor tools, and ambiguity levels. These additions will be placed in the main results table and a new ablation subsection, with details on compute budget for the extra runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper describes a two-stage synthesis pipeline (self-evolving base trajectories followed by oracle-preserving augmentations) and reports empirical gains on independent public benchmarks (BFCL v3, ACEBench, general-ability suites). No equations, fitted parameters renamed as predictions, or self-citation chains are invoked to derive the central claims; the evaluation metrics and test sets lie outside the synthesis process itself, so the reported improvements constitute independent evidence rather than a reduction to the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions that validation produces reliable trajectories and that augmentations preserve oracle correctness while adding useful complexity; no new mathematical entities or free parameters are explicitly introduced in the abstract.

axioms (2)

domain assumption Multi-level validation can filter trajectories to produce reliable base data for RL
Invoked in the description of the first stage of the pipeline.
domain assumption Augmentations can increase environmental complexity while strictly preserving oracle tool calls and answers
Central to the second stage and the claim of automatic reward computation.

pith-pipeline@v0.9.0 · 5567 in / 1424 out tokens · 61744 ms · 2026-05-10T17:37:25.250587+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield- Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson,...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ACEBench: A comprehensive evaluation of LLM tool usage

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Yuefeng Huang, Xiangcheng Liu, Xinzhi Wang, and Wu Liu. ACEBench: A comprehensive evaluation of LLM tool usage. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 12970–12998,

2025
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman, Joshua Hilton, Ilya Sutskever, Dario Amodei, and Wojciech Zaremba. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hanwei Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, and Hui Qu. DeepSeek-...

work page internal anchor Pith review arXiv
[5]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capa- bilities of large language models.arXiv preprint arXiv:2406.13542,

work page arXiv
[6]

Tool-star: Empowering llm- brained multi-tool reasoner via reinforcement learn- ing.arXiv:2505.16410, 2025

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-Star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410,

work page arXiv
[7]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in LLMs.arXiv preprint arXiv:2504.11536,

work page internal anchor Pith review arXiv
[8]

Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. RLEF: Grounding code LLMs in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089,

work page arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

10 Preprint. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page Pith review arXiv
[12]

Webthinker: Empowering large reasoning models with deep research capability,

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025a. Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated RL.arXiv preprint arXiv:2503.23383, 2025b. Minpeng Liao, Wei L...

work page arXiv
[13]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. ToolACE: Winning the points of L...

work page arXiv
[14]

Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, and Jane Dwivedi-Yu

We use problems from the 2024 and 2025 AIME contests as an evaluation benchmark. Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, and Jane Dwivedi-Yu. Toolverifier: Generalization to new tools via self-verification. arXiv preprint arXiv:2402.14158,

work page arXiv 2024
[15]

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, and Caiming Xiong. APIGen-MT: Agentic pipeline for multi- turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601,

work page arXiv
[16]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

work page internal anchor Pith review arXiv
[17]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review arXiv
[19]

MAG-V: A multi-agent framework for synthetic data generation and verification.arXiv preprint arXiv:2412.04494, 2024

11 Preprint. Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis, Akshay Mallipeddi, Abhinav Mathur, Joseph Ross, and Liang Gou. Mag-v: A multi-agent framework for synthetic data genera- tion and verification.arXiv preprint arXiv:2412.04494,

work page arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.arXiv preprint arXiv:2009.01325,

work page arXiv 2009
[22]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301,

work page arXiv
[23]

Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yu- taka Matsuo, and Jiaxian Guo

Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Xinyuan Song, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Zijun Wang, Kuan Lu, Menghao Huo, Tang Jingqun, Guangwu Qian, Keqin Li, Qiuwu Chen, and Lewei He. Enhancing code LLMs with reinforcement learning in code generation: A survey.arXiv preprint arXiv:2...

work page arXiv
[24]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and chal- lenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024b. Zhangchen Xu, Adrian...

work page internal anchor Pith review arXiv
[25]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Tool zero: Training tool- augmented LLMs via pure RL from scratch

Yirong Zeng, Xiao Ding, Yutai Hou, Yuxian Wang, Li Du, Juyi Dai, Qiuyang Ding, Duyu Tang, Dandan Tu, Weiwen Liu, Bing Qin, and Ting Liu. Tool zero: Training tool- augmented LLMs via pure RL from scratch. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 9135–9147,

2025
[27]

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Zhiwei Liu, Yihao Feng, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A family of large action models to empower AI agent systems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the ...

work page arXiv 2025
[28]

If none of the functions can be used, point it out

12 Preprint. A Experimental Details Training setup.All RL experiments are conducted using VERL 1, an open-source RL frame- work for LLM training. We adopt Qwen2.5-Instruct-7B as the LLM judge for capability- conditioned reward verification; the preserved oracle metadata and explicit capability labels narrow the judge’s task to a single behavioral criterio...

2025