pith. sign in

arxiv: 2606.17546 · v1 · pith:NPL33ZPQnew · submitted 2026-06-16 · 💻 cs.AI

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

Pith reviewed 2026-06-27 01:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords self-evolving agentsLLM agentsagent harnessevaluation environmentself-evolutionTerminal-Benchheld-out transferreplay diagnostics
0
0 comments X

The pith

SEAGym shows that multiple evaluation views are required to detect whether agent harness updates produce reusable gains or merely overfit recent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SEAGym as an environment that converts static benchmarks into sources supplying train batches, frozen validation sets, held-out ID and OOD tests, replay diagnostics, and cost records. It claims that conventional single-curve or isolated-score evaluations obscure whether harness changes improve generalization, raise costs, or degrade earlier behavior. When the environment is applied to Terminal-Bench 2.0 and HLE, comparisons of ACE, TF-GRPO, and AHE under a common protocol demonstrate that the separate views yield non-redundant information. A sympathetic reader would therefore treat single-metric progress reports as incomplete for self-evolving agents.

Core claim

SEAGym supplies dynamic task sources together with train, validation, test, replay, and cost views so that harness updates can be tracked for reusability, overfitting, cost growth, and retention of prior behavior; experiments on two benchmarks reveal that frequent updates may leave held-out performance unchanged, that intermediate snapshots can later degrade, and that source diversity plus model choice influence observed reliability.

What carries the argument

SEAGym evaluation environment that turns Harbor-compatible benchmarks into multi-view task sources with fixed update-validation, held-out transfer, replay, and snapshot recording.

If this is right

  • Frequent harness updates can fail to raise held-out performance even while training scores rise.
  • Intermediate snapshots that look strong on validation can later lose capability on the same tasks.
  • Harness reliability depends on the diversity of task sources and the choice of model backend.
  • Replay diagnostics are needed to detect whether an update harms earlier learned behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of self-evolving agents may need to optimize harnesses against a vector of views rather than a single scalar reward.
  • Cost records could be used to penalize updates that increase token usage without corresponding generalization gains.
  • The framework suggests that future work should test whether adding OOD transfer views changes which evolution method ranks highest.

Load-bearing premise

The selected benchmarks and shared epoch-batch protocol give an unbiased comparison of the three evolution methods without artifacts from task choice or model selection.

What would settle it

An experiment in which all evaluation views produce identical method rankings and identical conclusions about snapshot stability would falsify the claim that the views supply complementary signals.

Figures

Figures reproduced from arXiv: 2606.17546 by Bin Liang, Changshui Zhang, Chuanyi Xue, Congjie Zheng, Jun Yang.

Figure 1
Figure 1. Figure 1: Overview of SEAGYM. The environment samples train batches, runs task episodes, records trajectories and verifier feedback, lets the self-evolving agent update its own state, and records evaluation points as frozen snapshots. Snapshot quality is measured with frozen update-validation, final held-out tests, replay diagnostics, and cost metrics. 3.2 Evolution Schedule SEAGYM represents online, single-task, ba… view at source ↗
Figure 2
Figure 2. Figure 2: Baseline learning curves for AHE, ACE, and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AHE train replay grid. Green cells are successful trials, gray cells are failed trials, and red cells are rollout [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AHE train replay diagnostics. Left: success rate on the 80 source train tasks. Middle: pairwise delta task [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-model ID and OOD gains. Rows indi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: AHE batch-size success-rate breakdown by [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Baseline learning curves by train batch index. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: AHE batch-size learning curves by train batch index. Different batch sizes have different num￾bers of update points; validation points are plotted at the corresponding epoch-end batch index. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: AHE source-diversity learning curves by train batch index. Validation points are plotted at the corresponding epoch-end batch index. D.1 Cross-Model Continuation Results The cross-model appendix compares AHE training runs using the same Terminal-Bench + HLE source setting and batch-20 schedule, with DeepSeek-V4- Flash, GLM-5.1, and GPT-5.4 as the training back￾end. The GPT-5.4 row uses the continuation ru… view at source ↗
Figure 12
Figure 12. Figure 12: AHE cross-model learning curves. The left [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: AHE cross-model success-rate breakdown by [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Source-group train replay diagnostics for AHE. Each panel reports success rate and [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
read the original abstract

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SEAGym, an evaluation environment that converts Harbor-compatible benchmarks (Terminal-Bench 2.0 and HLE) into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID/OOD transfer, replay diagnostics, and cost records. It compares three harness-update methods (ACE, TF-GRPO, AHE) under a single shared epoch/batch protocol and claims that the multi-view evaluation yields complementary signals: frequent updates can fail to improve held-out performance, useful intermediate snapshots can later collapse, and source diversity plus model backend can affect reliability.

Significance. If the multi-view diagnostics are shown to be robust and the protocol artifacts are ruled out, SEAGym would provide a concrete advance over single-curve or isolated-task evaluations of self-evolving agents, enabling clearer detection of overfitting, collapse, and cost trade-offs. The work ships an explicit evaluation harness rather than isolated scores, which is a positive contribution to reproducibility in this area.

major comments (2)
  1. [Experimental protocol] Experimental protocol section: the central claim that the views supply complementary signals rests on the shared epoch/batch protocol being neutral across ACE, TF-GRPO, and AHE. No ablation or justification is given for the chosen batch sizes, epoch counts, or task ordering; if these hyperparameters are suboptimal for one method, the observed patterns (e.g., frequent updates failing held-out, snapshot collapse) could be protocol artifacts rather than intrinsic properties of self-evolution.
  2. [Results] Results on Terminal-Bench 2.0 and HLE: the claim that source diversity and model backend affect harness reliability is demonstrated only on the two chosen benchmarks. Without a sensitivity study across additional task distributions or an explicit check that task selection does not systematically favor certain update frequencies, the generality of the complementary-signals conclusion remains under-supported.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction use “Harbor-compatible” without a forward reference or brief definition; a one-sentence gloss would improve accessibility.
  2. [Figures] Figure captions for the multi-view diagrams should explicitly label which panels correspond to train, val, test, replay, and cost records.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript. We address the major comments point by point below, indicating planned revisions where appropriate to strengthen the claims about complementary evaluation signals in SEAGym.

read point-by-point responses
  1. Referee: [Experimental protocol] Experimental protocol section: the central claim that the views supply complementary signals rests on the shared epoch/batch protocol being neutral across ACE, TF-GRPO, and AHE. No ablation or justification is given for the chosen batch sizes, epoch counts, or task ordering; if these hyperparameters are suboptimal for one method, the observed patterns (e.g., frequent updates failing held-out, snapshot collapse) could be protocol artifacts rather than intrinsic properties of self-evolution.

    Authors: We agree that justifying the shared protocol is important to support the claim of complementary signals. The protocol was designed to hold training conditions constant across methods, allowing us to attribute differences to the update strategies themselves. Specific choices for batch sizes and epochs were made to ensure feasible computation while providing multiple update opportunities within the self-evolution process. Task ordering was randomized per batch to avoid systematic biases. In the revised version, we will expand the experimental protocol section with a dedicated paragraph explaining these choices and their rationale, including references to preliminary tuning on smaller scales. This addresses the concern about potential artifacts. revision: yes

  2. Referee: [Results] Results on Terminal-Bench 2.0 and HLE: the claim that source diversity and model backend affect harness reliability is demonstrated only on the two chosen benchmarks. Without a sensitivity study across additional task distributions or an explicit check that task selection does not systematically favor certain update frequencies, the generality of the complementary-signals conclusion remains under-supported.

    Authors: The two benchmarks were chosen to represent different domains: Terminal-Bench 2.0 for practical terminal interactions and HLE for advanced reasoning tasks, providing initial evidence that source diversity impacts reliability. We acknowledge that this does not constitute a full sensitivity analysis across many distributions. To strengthen the manuscript, we will add a limitations paragraph in the discussion noting the scope of the benchmarks used and the need for future work on broader task sets. Additionally, we will include an explicit statement that task selection was based on availability in Harbor-compatible format rather than optimization for specific methods. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation framework with no derivations

full rationale

The paper describes SEAGym, an evaluation harness for agent evolution, and reports empirical results from running ACE, TF-GRPO, and AHE on Terminal-Bench 2.0 and HLE under a shared protocol. No equations, parameter fits, predictions, or derivation chains exist that could reduce to inputs by construction. Claims about complementary signals rest on direct experimental observations rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained as a benchmark and protocol description; the reader's assessment of score 1.0 aligns with the absence of any mathematical or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all arrays left empty due to lack of technical detail.

pith-pipeline@v0.9.1-grok · 5737 in / 1302 out tokens · 38950 ms · 2026-06-27T01:30:02.309695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 10 internal anchors

  1. [1]

    2023 , url =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

  2. [2]

    Advances in Neural Information Processing Systems , year =

    Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =

  3. [3]

    Advances in Neural Information Processing Systems , year =

    Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems , year =

  4. [4]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

  5. [5]

    Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

  6. [6]

    2024 , url =

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Tuo and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yiheng and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , journal =. 2024 , url =

  7. [7]

    International Conference on Learning Representations , year =

    Mialon, Gr. International Conference on Learning Representations , year =

  8. [8]

    and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E

    Patil, Shishir G. and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The. 2025 , url =

  9. [9]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author =. arXiv preprint arXiv:2406.12045 , year =

  10. [10]

    2026 , url =

    Merrill, William and others , journal =. 2026 , url =

  11. [11]

    Humanity's Last Exam

    Humanity's Last Exam , author =. arXiv preprint arXiv:2501.14249 , year =

  12. [12]

    arXiv preprint arXiv:2412.19437 , year =

  13. [13]

    2026 , howpublished =

  14. [14]

    2025 , howpublished =

  15. [15]

    2025 , url =

    Zeng, Aohan and others , journal =. 2025 , url =

  16. [16]

    Neural Networks , volume =

    Continual Lifelong Learning with Neural Networks: A Review , author =. Neural Networks , volume =. 2019 , doi =

  17. [17]

    2026 , url =

    Jiang, Sihang and Ma, Lipeng and Hong, Zhonghua and Wang, Keyi and Lu, Zhiyu and Chen, Shisong and Zhang, Jinghao and Pan, Tianjun and Zhou, Weijia and Liang, Jiaqing and Xiao, Yanghua , journal =. 2026 , url =

  18. [18]

    2025 , url =

    Zheng, Junhao and Cai, Xidi and Li, Qiuke and Zhang, Duzhen and Li, ZhongZhi and Zhang, Yingying and Song, Le and Ma, Qianli , journal =. 2025 , url =

  19. [19]

    Advances in Neural Information Processing Systems , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

  20. [20]

    and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E

    Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , journal =. Gorilla: Large Language Model Connected with Massive. 2023 , url =

  21. [21]

    2024 , url =

    Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , url =

  22. [22]

    2025 , doi =

    Yuan, Siyu and Song, Kaitao and Chen, Jiangjie and Tan, Xu and Shen, Yongliang and Kan, Ren and Li, Dongsheng and Yang, Deqing , booktitle =. 2025 , doi =

  23. [23]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. arXiv preprint arXiv:2305.16291 , year =

  24. [24]

    Wang, Xingyao and Rosenberg, Simon and Michelini, Juan and Smith, Calvin and Tran, Hoang and Nyst, Engel and Malhotra, Rohit and Zhou, Xuhui and Chen, Valerie and Brennan, Robert and others , journal =. The. 2025 , url =

  25. [25]

    and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =. 2024 , url =

  26. [26]

    How Memory Management Impacts

    Xiong, Guangzhi and Jin, Qiao and Zhang, Zhizheng and Lu, Xiao and Wang, Zhiyong and Ma, Meng and Wang, Xunzhu and Wang, Yiyang and Liu, Yikai and Sun, Huaxiu and Wang, Fei and Liu, Zhiyong and Liu, Chenyan , booktitle =. How Memory Management Impacts. 2025 , url =

  27. [27]

    A Survey of Agent Interoperability Protocols: Model Context Protocol (

    Ehtesham, Usama and Dib, Salam and Almajali, Sufian and Peixoto, Tiago and Bhattacharya, Jay and Singla, Anupam and Diamantopoulos, Themistoklis , journal =. A Survey of Agent Interoperability Protocols: Model Context Protocol (. 2025 , url =

  28. [28]

    and Sun, Jun , booktitle =

    Wang, Haoyu and Poskitt, Christopher M. and Sun, Jun , booktitle =. 2026 , url =

  29. [29]

    2026 , note =

    Agent Harness Engineering: A Survey , author =. 2026 , note =

  30. [30]

    2026 , howpublished =

    Harness Engineering: Leveraging. 2026 , howpublished =

  31. [31]

    2026 , howpublished =

    Harness Design for Long-Running Application Development , author =. 2026 , howpublished =

  32. [32]

    2025 , howpublished =

    Agent Frameworks, Runtimes, and Harnesses - Oh My! , author =. 2025 , howpublished =

  33. [33]

    2026 , howpublished =

    How Middleware Lets You Customize Your Agent Harness , author =. 2026 , howpublished =

  34. [34]

    Natural-Language Agent Harnesses

    Natural-Language Agent Harnesses , author =. arXiv preprint arXiv:2603.25723 , year =

  35. [35]

    International Conference on Learning Representations , year =

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models , author =. International Conference on Learning Representations , year =

  36. [36]

    and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J

    Agrawal, Lakshya A. and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J. and Jiang, Meng and Potts, Christopher and Sen, Koushik and Dimakis, Alexandros G. and Stoica, Ion and Klein, Dan and Zaharia, Matei and Khattab, Omar , booktitle =. 2026 , url =

  37. [37]

    arXiv preprint arXiv:2510.08191 , year=

    Training-Free Group Relative Policy Optimization , author =. arXiv preprint arXiv:2510.08191 , year =

  38. [38]

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses , author =. arXiv preprint arXiv:2604.25850 , year =

  39. [39]

    Connection Science , volume =

    Catastrophic Forgetting, Rehearsal and Pseudorehearsal , author =. Connection Science , volume =. 1995 , doi =

  40. [40]

    Proceedings of the National Academy of Sciences , volume =

    Overcoming Catastrophic Forgetting in Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi =

  41. [41]

    Advances in Neural Information Processing Systems , year =

    Gradient Episodic Memory for Continual Learning , author =. Advances in Neural Information Processing Systems , year =

  42. [42]

    On Tiny Episodic Memories in Continual Learning

    On Tiny Episodic Memories in Continual Learning , author =. arXiv preprint arXiv:1902.10486 , year =

  43. [43]

    Three scenarios for continual learning

    Three Scenarios for Continual Learning , author =. arXiv preprint arXiv:1904.07734 , year =

  44. [44]

    A Comprehensive Survey of Self-Evolving

    Fang, Jinyuan and Peng, Yanwen and Zhang, Xi and Wang, Yingxu and Yi, Xinhao and Zhang, Guibin and Xu, Yi and Wu, Bin and Liu, Siwei and Li, Zihao and Ren, Zhaochun and Aletras, Nikos and Wang, Xi and Zhou, Han and Meng, Zaiqiao , journal =. A Comprehensive Survey of Self-Evolving. 2025 , url =

  45. [45]

    International Conference on Machine Learning , year =

    Self-Rewarding Language Models , author =. International Conference on Machine Learning , year =

  46. [46]

    Training Language Models to Self-Correct via Reinforcement Learning

    Training Language Models to Self-Correct via Reinforcement Learning , author =. arXiv preprint arXiv:2409.12917 , year =

  47. [47]

    Rewarding Progress: Scaling Automated Process Verifiers for

    Setlur, Amrith and Nagpal, Chirag and Fisch, Adam and Geng, Xinyang and Eisenstein, Jacob and Agarwal, Rishabh and Agarwal, Alekh and Berant, Jonathan and Kumar, Aviral , booktitle =. Rewarding Progress: Scaling Automated Process Verifiers for. 2025 , url =

  48. [48]

    Process Reward Models for

    Choudhury, Sanjiban , journal =. Process Reward Models for. 2025 , url =

  49. [49]

    2025 , url =

    Wang, Zihan and Wang, Kangrui and Wang, Qineng and Zhang, Pingyue and Li, Linjie and Yang, Zhengyuan and Jin, Xing and Yu, Kefan and Nguyen, Minh Nhat and Liu, Licheng and Gottlieb, Eli and Lu, Yiping and Cho, Kyunghyun and Wu, Jiajun and Fei-Fei, Li and Wang, Lijuan and Choi, Yejin and Li, Manling , journal =. 2025 , url =

  50. [50]

    2025 , url =

    Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun , journal =. 2025 , url =

  51. [51]

    International Conference on Machine Learning , year =

    Language Agents as Optimizable Graphs , author =. International Conference on Machine Learning , year =

  52. [52]

    2025 , url =

    Zhang, Jiayi and Xiang, Jinyu and Yu, Zhaoyang and Teng, Fengwei and Chen, Xionghui and Chen, Jiaqi and Zhuge, Mingchen and Cheng, Xin and Hong, Sirui and Wang, Jinlin and Zheng, Bingnan and Liu, Bang and Luo, Yuyu and Wu, Chenglin , journal =. 2025 , url =

  53. [53]

    Agent Workflow Memory

    Agent Workflow Memory , author =. arXiv preprint arXiv:2409.07429 , year =

  54. [54]

    2025 , url =

    Yuan, Siyu and Song, Kaitao and Chen, Jiangjie and Tan, Xu and Li, Dongsheng and Yang, Deqing , journal =. 2025 , url =

  55. [55]

    2025 , url =

    Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao and Zhang, Yongfeng , journal =. 2025 , url =

  56. [56]

    Tang, Xiangru and Qin, Tianrui and Peng, Tianhao and Zhou, Ziyang and Shao, Daniel and Du, Tingting and Wei, Xinming and Xia, Peng and Wu, Fang and Zhu, He and Zhang, Ge and Liu, Jiaheng and Wang, Xingyao and Hong, Sirui and Wu, Chenglin and Cheng, Hao and Wang, Chi and Zhou, Wangchunshu , journal =. Agent. 2025 , url =

  57. [57]

    Survey on Evaluation of

    Yehudai, Asaf and Eden, Lilach and Li, Alan and Uziel, Guy and Zhao, Yilun and Bar-Haim, Roy and Cohan, Arman and Shmueli-Scheuer, Michal , journal =. Survey on Evaluation of. 2026 , url =