SEAL: Synergistic Co-Evolution of Agents and Learning Environments
Pith reviewed 2026-06-30 13:35 UTC · model grok-4.3
The pith
SEAL co-evolves LLM agents and training environments via shared turn-level failure diagnoses to improve tool-use learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEAL collects on-policy trajectories under executable verification, converts failed rollouts into turn-level failure labels, and uses these labels both to evolve the environment's training interface (clearer tool affordances, constraint information, recovery feedback) and to optimize the policy through diagnosis-guided advantage reweighting, producing consistent gains across backbones in in-distribution and out-of-distribution tool-use evaluations.
What carries the argument
The shared signal of turn-level failure labels diagnosed from on-policy trajectories, applied simultaneously to environment adaptation and policy optimization.
If this is right
- With only 400 training samples SEAL produces average-point gains between +8.25 and +26.25 across three different backbones.
- The same training run yields positive transfer on out-of-distribution multi-turn tool-use evaluations.
- Environment adaptation consists of exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback.
- Policy optimization uses diagnosis-guided advantage reweighting derived from the same failure labels.
Where Pith is reading between the lines
- The same shared-label loop might reduce the volume of human-written demonstrations needed for other interactive tasks such as code repair or multi-step planning.
- Dynamic learning interfaces that change with the agent's revealed weaknesses could complement static reward models used in conventional reinforcement learning from human feedback.
- If the failure-diagnosis step itself can be automated reliably, the method points toward fully closed-loop self-improvement without external supervision at each cycle.
Load-bearing premise
Turn-level failure labels extracted from on-policy trajectories form a reliable shared signal that can drive both environment changes and policy updates without adding new biases or inconsistencies.
What would settle it
Running the same 400-sample training regime on the three backbones but replacing the diagnosed failure labels with random or fixed labels, then observing no average-point gains or loss of out-of-distribution transfer, would falsify the value of the shared-signal mechanism.
read the original abstract
Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that most self-evolution methods for LLM agents adapt either the policy or the learning environment in isolation, creating 'Agent-Environment Misalignment.' SEAL addresses this via a closed-loop framework that collects on-policy trajectories, diagnoses failed rollouts into turn-level failure labels, and uses these labels as a shared signal: the environment adapts by exposing clearer tool affordances, constraints, and recovery feedback, while the policy is optimized via diagnosis-guided advantage reweighting. With only 400 training samples, SEAL reports +8.25 to +26.25 average-point gains across three backbones on in-distribution and out-of-distribution multi-turn tool-use evaluations, demonstrating positive OOD transfer.
Significance. If the central claim holds after validation of the diagnosis step, the work would be significant for low-resource agent learning: jointly adapting the learner and its training substrate could yield more robust self-improving agents than isolated policy or environment tuning, with the reported OOD transfer and small-sample gains offering a concrete path toward efficient interactive tool-use systems.
major comments (2)
- [Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.
- [Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential significance of the SEAL framework for low-resource agent learning. The two major comments both concern the level of detail in the abstract regarding the diagnosis procedure, executable verification, and advantage reweighting. The full manuscript elaborates on these elements; we will revise the abstract to incorporate brief descriptions and validation references as part of addressing the major revision request.
read point-by-point responses
-
Referee: [Abstract] The performance gains (+8.25 to +26.25 points) and OOD transfer rest on the claim that turn-level failure labels from on-policy trajectories provide a reliable shared signal for simultaneous environment adaptation and policy optimization. The abstract provides no description of the diagnosis procedure, no inter-annotator agreement or validation metrics, and no ablation on label noise; if the diagnosis is LLM-mediated or heuristic, errors could create self-reinforcing loops that inflate in-distribution scores without true robustness. This is load-bearing for the closed-loop co-evolution claim.
Authors: We agree that the abstract is too concise on the diagnosis procedure and lacks any mention of validation or noise analysis. The full manuscript details the turn-level diagnosis process and includes supporting analyses. To directly address the concern about reliability and potential self-reinforcing loops, we will revise the abstract to add a short clause summarizing the diagnosis validation approach and reference to noise ablations. revision: yes
-
Referee: [Abstract] The methods for executable verification of trajectories and the precise mechanism of diagnosis-guided advantage reweighting are not detailed in the provided abstract; without these, it is impossible to assess whether the reported gains reduce to fitted parameters or introduce new biases in the co-evolution loop.
Authors: We acknowledge that the abstract omits specifics on executable verification and the advantage reweighting mechanism. These components are described in the full manuscript. We will revise the abstract to include concise descriptions of both the verification process and the reweighting approach, enabling readers to better evaluate potential sources of gains or bias. revision: yes
Circularity Check
No circularity; empirical results with no derivations or self-referential reductions
full rationale
The paper describes an empirical framework (SEAL) for co-evolving LLM agents and environments via on-policy trajectories and turn-level failure labels, reporting experimental gains (+8.25 to +26.25 points with 400 samples, OOD transfer) across backbones. No equations, mathematical derivations, fitted parameters, or first-principles claims appear in the provided text. The central results are presented as direct experimental outcomes rather than predictions that reduce to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps. The diagnosis procedure and shared-signal assumption are described at a high level but not derived from prior results within the paper itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Failed rollouts can be reliably diagnosed into accurate turn-level failure labels that serve as a useful shared signal
invented entities (1)
-
Agent-Environment Misalignment
no independent evidence
Reference graph
Works this paper leans on
-
[1]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
2022
-
[2]
Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023
2023
-
[3]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[4]
Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, et al. Omnivideo-r1: Reinforcing audio-visual reasoning with query intention and modality attention.arXiv preprint arXiv:2602.05847, 2026
-
[5]
Dual Latent Memory for Visual Multi-agent System
Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Agentic Reasoning for Large Language Models
Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation
Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 496–507, 2025
2025
-
[9]
Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026
Minghe Gao, Zhongqi Yue, Wenjie Yan, Yihao Hu, Wei Ji, Siliang Tang, Jun Xiao, Tat-Seng Chua, Yueting Zhuang, and Juncheng Li. Counterfactual evolution of multimodal datasets via visual programming.Advances in Neural Information Processing Systems, 38:81947–81976, 2026
2026
-
[10]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025
Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025
-
[12]
Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025
-
[13]
Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700, 2025
-
[14]
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, and Zhihao Wen. Atlasva: Self-evolving visual skill memory for teacher-free vlm agents.arXiv preprint arXiv:2605.17933, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Api-bank: A comprehensive benchmark for tool-augmented llms
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 3102–3116, 2023
2023
-
[16]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
2024
-
[17]
Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025
Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, and Yang Liu. Agent-environment alignment via automated interface generation.arXiv preprint arXiv:2505.21055, 2025
-
[18]
Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026
Xingkun Yin and Hongyang Du. Glove: Global verifier for llm memory-environment realignment.arXiv preprint arXiv:2601.19249, 2026. 12
-
[19]
Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026
Hanli Peng, Yongsen Zheng, Ziyao Liu, and Kwok-Yan Lam. Tool execution hallucination in llm-based agents: A unified taxonomy with detection, mitigation, and future directions.TechRxiv, 2026
2026
-
[20]
Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, and Eduard Hovy. Evotool: Self-evolving tool-use policy optimization in llm agents via blame-aware mutation and diversity-aware selection. arXiv preprint arXiv:2603.04900, 2026
-
[21]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Training language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025
-
[23]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026
Zhishang Xiang, Chengyi Yang, Zerui Chen, Zhimin Wei, Yunbo Tang, Zongpei Teng, Zexi Peng, Zongxia Li, Chengsong Huang, Yicheng He, et al. A systematic survey of self-evolving agents: From model-centric to environment-driven co-evolution.TechRxiv, 2026
2026
-
[25]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[26]
Curriculum learning
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009
2009
-
[27]
Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025
Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, and Tao Lin. Don’t just fine-tune the agent, tune the environment.arXiv preprint arXiv:2510.10197, 2025
-
[28]
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, and Xiangxiang Chu. Coevolve: Training llm agents via agent-data mutual evolution.arXiv preprint arXiv:2604.15840, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, et al. From failure to mastery: Generating hard samples for tool-use agents.arXiv preprint arXiv:2601.01498, 2026
-
[30]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026
Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, and Ling Shao. Self-consolidation for self-evolving agents.arXiv preprint arXiv:2602.01966, 2026
-
[32]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:25...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025
Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self-adapting language models.arXiv preprint arXiv:2506.10943, 2025
-
[34]
Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020
Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning.IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2020
2020
-
[35]
Automatic curriculum learning for deep RL: A short survey
Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep RL: A short survey. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 4819–4825, 2020
2020
-
[36]
Self-instruct: Aligning language models with self-generated instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, 2023. 13
2023
-
[37]
Wizardlm: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[38]
Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers.arXiv preprint arXiv:2305.17126, 2023
-
[39]
Creator: Tool creation for disentangling abstract and concrete reasoning of large language models
Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939, 2023
2023
-
[40]
Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R Fung, Hao Peng, and Heng Ji. Craft: Customizing llms by creating and retrieving from specialized toolsets.arXiv preprint arXiv:2309.17428, 2023
-
[41]
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025
2025
-
[42]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Agentgym: Evolving large language model-based agents across diverse environments
Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024
-
[46]
Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025
-
[47]
Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025
-
[48]
Gemini 3 pro preview model documentation
Google. Gemini 3 pro preview model documentation. Google AI for Developers documentation, 2025. https://ai.google.dev/gemini-api/docs/models/gemini-3-pro-preview
2025
-
[49]
Claude sonnet 4.5 system card
Anthropic. Claude sonnet 4.5 system card. System card, 2025. https://www.anthropic.com/claude-sonnet-4-5-system-card
2025
-
[50]
Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Aaron Welihinda, Alex Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Nitish Sachdeva, Inderjit Dhillon, Michael Blistein, Omer Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Glm-4.6 model card
Zhipu AI. Glm-4.6 model card. Hugging Face model card, 2025.https://huggingface.co/zai-org/GLM-4.6
2025
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jinkai Xu, Jing Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Ke...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
xlam-2-3b-fc-r model card
Salesforce AI Research. xlam-2-3b-fc-r model card. Hugging Face model card, 2025. https://huggingface.co/Salesforce/xLAM-2-3b-fc-r
2025
-
[56]
Toolace: Winning the points of llm function calling
Weiwen Liu, Xingshan Zeng, Keqing He, Yongliang Wang, Zhaoyang Yan, Fanya Wang, Jingsheng Cheng, Runlong Wang, Minpeng Shen, Xin Jiang, Yujie Qian, Qun Liu, and Lifeng Shang. Toolace: Winning the points of llm function calling. InInternational Conference on Learning Representations, 2025
2025
-
[57]
Bitagent-8b model card
BitAgent. Bitagent-8b model card. Hugging Face model card, 2025. https://huggingface.co/BitAgent/BitAgent-8B
2025
-
[58]
name": <function-name>,
watt-ai. watt-tool-8b model card. Hugging Face model card, 2024. https://huggingface.co/watt-ai/watt-tool-8B. 15 Appendix A Benchmark and Evaluation Details BFCL V3.Our in-distribution evaluation uses the multi-turn subset of the Berkeley Function-Calling Leaderboard (BFCL) V3. The benchmark evaluates whether an agent can correctly use executable tools ov...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.