Recognition: unknown
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3
The pith
DataPRM improves AI data analysis agents by actively probing execution states to catch silent errors and applying ternary rewards that separate fixable mistakes from fatal ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DataPRM is an environment-aware generative process reward model that autonomously interacts with the execution environment to probe intermediate states and detect silent errors while employing a reflection-aware ternary reward strategy that distinguishes correctable grounding errors from irrecoverable mistakes, thereby providing more effective supervision for agentic data analysis than general-domain PRMs.
What carries the argument
DataPRM, a generative process reward model that acts as an active verifier through environment probing and applies reflection-aware ternary rewards to separate fixable from fatal errors.
If this is right
- DataPRM raises policy LLM accuracy by 7.21 percent on ScienceAgentBench and 11.28 percent on DABStep when used for Best-of-N selection.
- Inserting DataPRM into reinforcement learning training produces 78.73 percent on DABench and 64.84 percent on TableBench, exceeding outcome-reward baselines.
- A 4B-parameter DataPRM already beats larger strong baselines and maintains gains across multiple test-time scaling methods.
- The model generalizes to diverse data analysis environments without task-specific retraining.
Where Pith is reading between the lines
- Similar environment-probing reward models could be built for other agentic domains such as code generation or experimental design where silent failures are common.
- Making reward models interactive rather than purely passive may become a standard requirement once agents operate in rich, stateful environments.
- The scalable trajectory-generation pipeline could be reused to create process supervision data for additional scientific workflows beyond the four benchmarks tested.
Load-bearing premise
The reported gains on downstream tasks arise from the environment-aware probing and ternary reward design rather than from the size, quality, or diversity of the 8K training instances or from unstated differences in evaluation setups.
What would settle it
Train a control reward model on the identical 8K trajectories but without environment probing or ternary labels, then measure whether the downstream improvements on ScienceAgentBench and DABStep disappear under the same Best-of-N and RL protocols.
Figures
read the original abstract
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing general-domain process reward models (PRMs) fail to supervise data analysis agents because they cannot detect silent errors (logical flaws without interpreter exceptions) and incorrectly penalize necessary exploratory actions. It introduces DataPRM, an environment-aware generative PRM that autonomously probes execution states to uncover silent errors and uses a reflection-aware ternary reward strategy to distinguish correctable grounding errors from irrecoverable mistakes. A scalable pipeline generates over 8K training instances via diversity-driven trajectories and knowledge-augmented annotation. Experiments report that DataPRM boosts downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep under Best-of-N inference, outperforms strong baselines despite having only 4B parameters, generalizes across test-time scaling methods, and when integrated into RL yields 78.73% on DABench and 64.84% on TableBench, outperforming outcome-reward baselines.
Significance. If the reported gains are causally attributable to the environment-aware probing and ternary reward design rather than data curation alone, the work meaningfully extends process-level supervision from static mathematical reasoning to dynamic, interactive agentic data analysis. It offers a concrete mechanism for active verification in environments with silent failures and provides empirical evidence that process rewards can improve both inference-time selection and RL training of scientific agents.
major comments (2)
- [Experiments] The central empirical claims (7.21% and 11.28% Best-of-N gains, plus RL improvements) rest on comparisons whose attribution to environment-aware probing and the ternary strategy is not isolated. No ablation is described that holds the 8K-instance dataset and training procedure fixed while removing autonomous probing (replacing it with static rewards) or switching from ternary to binary rewards; without such controls it remains possible that differences in trajectory diversity, annotation quality, or evaluation harness explain the deltas.
- [Method] The reflection-aware ternary reward strategy is introduced as distinguishing 'correctable grounding errors' from 'irrecoverable mistakes,' yet the manuscript provides no formal definition, decision procedure, or example annotation guidelines for assigning the three categories during step-level labeling. This makes it impossible to assess whether the strategy is reproducible or whether its benefit is separable from the knowledge-augmented annotation pipeline.
minor comments (2)
- [Abstract] Baseline models are referred to only as 'strong baselines' or 'general-domain PRMs' without explicit names, sizes, or training details in the abstract or early sections, complicating direct replication of the reported margins.
- [Abstract] The abstract states that DataPRM 'exhibits robust generalizability across diverse Test-Time Scaling strategies' but does not list the specific strategies tested or report per-strategy breakdowns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and outline the revisions we will make to strengthen the empirical attribution and methodological clarity.
read point-by-point responses
-
Referee: [Experiments] The central empirical claims (7.21% and 11.28% Best-of-N gains, plus RL improvements) rest on comparisons whose attribution to environment-aware probing and the ternary strategy is not isolated. No ablation is described that holds the 8K-instance dataset and training procedure fixed while removing autonomous probing (replacing it with static rewards) or switching from ternary to binary rewards; without such controls it remains possible that differences in trajectory diversity, annotation quality, or evaluation harness explain the deltas.
Authors: We agree that the reported gains would be more convincingly attributed to the environment-aware probing and ternary reward design if controlled ablations were provided that hold the 8K training instances and overall training procedure fixed. The current experiments compare DataPRM against strong baselines (including general-domain PRMs), but do not isolate these two components via static-reward or binary-reward variants on the same data. In the revised manuscript we will add these ablations, reporting performance deltas when probing is replaced by static rewards and when the ternary scheme is replaced by binary rewards, thereby clarifying the specific contributions of each design choice. revision: yes
-
Referee: [Method] The reflection-aware ternary reward strategy is introduced as distinguishing 'correctable grounding errors' from 'irrecoverable mistakes,' yet the manuscript provides no formal definition, decision procedure, or example annotation guidelines for assigning the three categories during step-level labeling. This makes it impossible to assess whether the strategy is reproducible or whether its benefit is separable from the knowledge-augmented annotation pipeline.
Authors: We acknowledge that the manuscript does not supply formal definitions, a decision procedure, or annotation examples for the three ternary categories. This omission hinders reproducibility assessment and separation from the broader annotation pipeline. In the revision we will insert a dedicated subsection that (i) formally defines each category, (ii) provides an explicit decision tree or rubric for step-level labeling, and (iii) includes several concrete annotation examples drawn from the 8K dataset, thereby enabling readers to reproduce the labeling process and evaluate the ternary strategy independently of the knowledge-augmented pipeline. revision: yes
Circularity Check
No significant circularity; empirical model training and benchmarking
full rationale
The paper contains no mathematical derivation, equations, or functional forms. It describes an empirical pipeline for generating 8K training instances via diversity-driven trajectories and knowledge-augmented annotation, followed by training DataPRM and evaluating downstream improvements on external benchmarks (ScienceAgentBench, DABStep, DABench, TableBench). Performance deltas are reported as experimental outcomes rather than predictions derived from fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims; the contribution reduces to standard supervised learning and ablation-style benchmarking without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Process-level reward signals can effectively guide LLM agents in dynamic environments when they include environment interaction and error-type differentiation.
invented entities (1)
-
DataPRM
no independent evidence
Forward citations
Cited by 1 Pith paper
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Reference graph
Works this paper leans on
-
[1]
Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vázquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, and Issam H. Laradji. 2025. AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery.CoRRabs/2504.07421 (2025). arXiv:2504.07421 doi:10.48550/ ARXIV.2504.07421
-
[2]
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. 2026. EvoSkill: Automated Skill Discovery for Multi-Agent Systems.CoRR abs/2603.02766 (2026). arXiv:2603.02766 doi:10.48550/ARXIV.2603.02766
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
-
[4]
Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, and Jinyoung Yeo. 2025. Web-Shepherd: Advancing PRMs for Reinforcing Web Agents. CoR...
-
[5]
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yan- feng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, and Siheng Chen. 2025. SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?CoRRabs/2507.05241 (2025). arXiv:2507.05241 doi:10.48550/ARXIV.2507.05241
-
[6]
Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. 2025. Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Prov...
-
[7]
Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. 2024. AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,...
2024
-
[8]
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards Reason- ing Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Mod- els.CoRRabs/2503.09567 (2025). arXiv:2503.09567 doi:10.48550/ARXIV.2503.09567
work page internal anchor Pith review doi:10.48550/arxiv.2503.09567 2025
-
[9]
Qiguang Chen, Ming-Hsuan Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yu Zhou, Jiaqi Wang, Zhi Chen, and Wanxiang Che. 2025. AI4Research: A Survey of Artificial Intelligence for Scientific Research.CoRRabs/2507.01903 (2025). arXiv:2507.01903 doi:10.48550/ARXIV.2507.01903
-
[10]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. 2025. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discover...
2025
-
[11]
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. 2025. Process Reinforcement through Implicit Rewards.CoRRabs/2502.01456 (2025). arXiv:2...
work page internal anchor Pith review arXiv 2025
-
[12]
DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025). arXiv:2512.02556 doi:10.48550/ARXIV.2512. 02556
-
[13]
Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, and Min Zhang. 2025. FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning. CoRRabs/2510.22543 (2025). arXiv:2510.22543 doi:10.48550/ARXIV.2510.22543
-
[15]
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025. ReTool: Reinforce- ment Learning for Strategic Tool Use in LLMs.CoRRabs/2504.11536 (2025). arXiv:2504.11536 doi:10.48550/ARXIV.2504.11536
work page internal anchor Pith review doi:10.48550/arxiv.2504.11536 2025
-
[16]
Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, and Yara Rizk. 2025. When Agents go Astray: Course-Correcting SWE Agents with PRMs.CoRR abs/2509.02360 (2025). arXiv:2509.02360 doi:10.48550/ARXIV.2509.02360
-
[17]
Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, and Hongyu Lin. 2024. Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering.CoRRabs/2411.11504 (2024). arXiv:2411.11504 doi:10.48550/ ARXIV.2411.11504
-
[18]
Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. 2025. Da...
2025
-
[19]
Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 2...
2024
-
[20]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2025. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?...
2025
- [21]
- [22]
-
[23]
Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruim- ing Tang, and Yong Yu. 2025. CodePRM: Execution Feedback-enhanced Pro- cess Reward Model for Code Generation. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad T...
2025
-
[24]
Autosdt: Scaling data-driven discovery tasks toward open co-scientists
Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, Chen Wei, Qianheng Zhang, Tianyu Zhang, Song Gao, Xuhui Huang, Xia Ning, Nesreen K. Ahmed, Ali Payani, and Huan Sun. 2025. AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists.CoRRabs/2506.08140 (2025). arXiv:2...
-
[25]
Pan, Guilin Qi, Haofen Wang, and Huajun Chen
Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, X...
-
[26]
SkillNet: Create, evaluate, and connect AI skills,
SkillNet: Create, Evaluate, and Connect AI Skills.CoRRabs/2603.04448 (2026). arXiv:2603.04448 doi:10.48550/ARXIV.2603.04448
-
[27]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=v8L0pN6EOi
2024
-
[28]
Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, and Ying Wen. 2025. ToolPRM: Fine-Grained Inference Scaling of Structured Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Qiu et al. Outputs for Function Calling.CoRRabs/2510.14703 (2025). arXiv:25...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.14703 2025
- [29]
-
[30]
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, and Monica S. Lam. 2026. DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling. https://api.semanticscholar.org/CorpusID:287248168
2026
- [31]
-
[32]
Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. 2025. Agentic Reinforcement Learning with Implicit Step Rewards. CoRRabs/2509.19199 (2025). arXiv:2509.19199 doi:10.48550/ARXIV.2509.19199
-
[34]
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. 2024. Improve Mathematical Reasoning in Language Models by Automated Process Supervision. CoRRabs/2406.06592 (2024). arXiv:2406.06592 doi:10.48550/ARXIV.2406.06592
work page internal anchor Pith review doi:10.48550/arxiv.2406.06592 2024
-
[35]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Cher- vonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. InProceedings of the...
-
[36]
Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, and Dongmei Zhang. 2023. In- sightPilot: An LLM-Empowered Automated Data Exploration System. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, Yansong Feng and Els Lefever (Eds.). Association for Comput...
-
[37]
Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, and Tomas Pfister. 2025. DS-STAR: Data Science Agent via Iterative Planning and Verification.CoRRabs/2509.21825 (2025). arXiv:2509.21825 doi:10.48550/ARXIV.2509.21825
-
[38]
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. 2026. Trace2Skill: Dis- till Trajectory-Local Lessons into Transferable Agent Skills.CoRRabs/2603.25158 (2026). arXiv:2603.25158 doi:10.48550/ARXIV.2603.25158
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.25158 2026
- [39]
- [40]
-
[41]
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani- Tür, Gokhan Tur, and Heng Ji. 2025. ToolRL: Reward is All Tool Learning Needs. CoRRabs/2504.13958 (2025). arXiv:2504.13958 doi:10.48550/ARXIV.2504.13958
work page internal anchor Pith review doi:10.48550/arxiv.2504.13958 2025
-
[42]
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. Reasoning with Language Model Prompting: A Survey. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Bo...
2023
-
[43]
Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen
-
[44]
arXiv:2509.25084 doi:10.48550/ARXIV.2509.25084
Scaling Generalist Data-Analytic Agents.CoRRabs/2509.25084 (2025). arXiv:2509.25084 doi:10.48550/ARXIV.2509.25084
- [45]
-
[46]
Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.CoRRabs/2504.2...
-
[47]
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. Agent Laboratory: Using LLM Agents as Research Assistants.CoRRabs/2501.04227 (2025). arXiv:2501.04227 doi:10.48550/ARXIV.2501.04227
-
[48]
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. 2025. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https:...
2025
-
[49]
Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. 2025. DeepSeekMath-V2: Towards Self- Verifiable Mathematical Reasoning.CoRRabs/2511.22570 (2025). arXiv:2511.22570 doi:10.48550/ARXIV.2511.22570
-
[50]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). arXiv:2402.03300 doi:10.48550/ARXIV.2402.03300
-
[51]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025. ACM, 1279–1297. doi:10.1145/3689031.3696075
-
[52]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test- Time Compute Optimally can be More Effective than Scaling Model Parameters. CoRRabs/2408.03314 (2024). arXiv:2408.03314 doi:10.48550/ARXIV.2408.03314
-
[53]
Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.CoRR abs/2508.05002 (2025). arXiv:2508.05002 doi:10.48550/ARXIV.2508.05002
-
[55]
Chenxi Wang, Zhuoyun Yu, Xinghong Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng. 2026. SkillX: Automatically Constructing Skill Knowledge Bases for Agents. https://api.semanticscholar.org/CorpusID:287204111
2026
-
[56]
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. 2024. Math-Shepherd: Verify and Reinforce LLMs Step- by-step without Human Annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-...
-
[57]
Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, and Haohan Wang. 2025. Large Language Model-based Data Science Agent: A Survey.CoRRabs/2508.02744 (2025). arXiv:2508.02744 doi:10.48550/ARXIV.2508.02744
- [58]
-
[59]
Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Tongliang Li, Zhoujun Li, and Guanglin Niu. 2025. TableBench: A Comprehensive and Complex Benchmark for Table Question Answering. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - Marc...
-
[60]
Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress.CoRRabs/2511.08325 (2025). arXiv:2511.08325 doi:10.48550/ARXIV.2511.08325
-
[61]
Wenyi Xu, Yuren Mao, Xiaolu Zhang, Chao Zhang, Xuemei Dong, Mengfei Zhang, and Yunjun Gao. 2025. DAgent: A Relational Database-Driven Data Analysis Report Generation Agent.CoRRabs/2503.13269 (2025). arXiv:2503.13269 doi:10.48550/ARXIV.2503.13269
-
[62]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
work page internal anchor Pith review doi:10.48550/arxiv.2505.09388 2025
-
[63]
Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zheng- hao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, and Maosong Sun. 2024. MatPlotAgent: Method and Evaluation for LLM-Based Agen- tic Scientific Data Visualization. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual...
-
[64]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X
2023
-
[65]
Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, and Yu Huang. 2025. DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science.CoRRabs/2503.07044 (2025). arXiv:2503.07044 doi:10.48550/ARXIV.2503.07044
-
[66]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
-
[67]
Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, and Shikun Zhang. 2024. Outcome-Refining Process Supervision for Code Generation.CoRRabs/2412.15118 (2024). arXiv:2412.15118 doi:10.48550/ARXIV. 2412.15118
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[68]
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-Rewarding Language Models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https://openreview.net/forum?id= 0NphYCmgua
2024
-
[69]
Hanrong Zhang, Shichen Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayuan Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. 2026. CoEvoSkills: Self-Evolving Agent Skills via Co- Evolutionary Verification. https://api.semanticscholar.org/CorpusID:287071917
2026
- [70]
-
[71]
Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. 2025. Deep- Analyze: Agentic Large Language Models for Autonomous Data Science.CoRR abs/2510.16872 (2025). arXiv:2510.16872 doi:10.48550/ARXIV.2510.16872
-
[72]
Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, and Yeyun Gong. 2025. Process-based Self-Rewarding Language Models. InFind- ings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association f...
2025
- [73]
-
[74]
Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. 2023. Data- Copilot: Bridging Billions of Data and Humans with Autonomous Workflow. CoRRabs/2306.07209 (2023). arXiv:2306.07209 doi:10.48550/ARXIV.2306.07209
- [75]
-
[76]
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. The Lessons of Developing Process Reward Models in Mathematical Reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina S...
2025
-
[77]
Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, and Bowen Zhou. 2025. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning.CoRR abs/2504.00891 (2025). arXiv:2504.00891 doi:10.48550/ARXIV.2504.00891
-
[78]
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2025. SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning. In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, ...
-
[79]
Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, and Weinan Zhang. 2025. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models.CoRRabs/2510.08049 (2025). arXiv:2510.08049 doi:10.48550/ARXIV.2510.08049
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.08049 2025
-
[80]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Pro- cessing Systems 36: Annual Conference on Neural Information Processing Sys- t...
2023
-
[81]
Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, and Chi Zhang. 2025. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models.CoRRabs/2508.15202 (2025). arXiv:2508.15202 doi:10.48550/ARXIV.2508.15202
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.15202 2025
-
[82]
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, and Yuyu Luo
-
[83]
https://api.semanticscholar.org/CorpusID:282389107
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?ArXiv abs/2510.23587 (2025). https://api.semanticscholar.org/CorpusID:282389107
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.