pith. sign in

arxiv: 2606.03980 · v1 · pith:G7RBLVKPnew · submitted 2026-06-02 · 💻 cs.LG · cs.CL

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Pith reviewed 2026-06-28 10:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reward modelagent skillheterogeneous evaluationLLM post-trainingreinforcement learningbest-of-N selectionreward benchmarks
0
0 comments X

The pith

Skill-RM reformulates reward modeling as execution of a reusable Reward-Evaluation Skill to unify heterogeneous criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Skill-RM as a way to handle the variety of evaluation signals used in reward models for LLM training. Instead of separate rule checkers, references, checklists, and rubrics, it casts reward computation as one structured agent task that a single reusable skill can run. The skill selects and combines evidence on the fly for each input. Experiments report better results than standard judge models on reward benchmarks plus downstream uses such as best-of-N selection and reinforcement learning. The central idea is that an agentic interface can deliver consistent, transparent scoring across task types.

Core claim

Skill-RM supplies a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. Treating reward computation as a structured agentic task gives a consistent interface for orchestrating heterogeneous resources and dynamically selecting and aggregating evidence tailored to each input, which yields consistency and transparency across diverse tasks.

What carries the argument

The Reward-Evaluation Skill, a reusable agentic module that dynamically selects and aggregates evidence from rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics.

If this is right

  • Skill-RM delivers higher scores than traditional judge baselines on standard reward benchmarks.
  • The same model improves best-of-N selection quality when used as the ranking signal.
  • Reinforcement learning pipelines obtain stronger training signals from the dynamically orchestrated evidence.
  • Evaluation becomes consistent and transparent across tasks that previously required separate verifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The agentic interface could be applied to other LLM evaluation settings that mix rules and rubrics, such as safety classifiers.
  • A single skill might replace collections of task-specific verifiers in large-scale alignment pipelines.
  • Dynamic evidence selection raises the possibility of measuring which evidence types contribute most to final scores on different domains.

Load-bearing premise

Reformulating reward computation as execution of a reusable Reward-Evaluation Skill will integrate heterogeneous evidence types while preserving or improving evaluation quality without introducing new inconsistencies or selection biases.

What would settle it

A controlled test set of new heterogeneous criteria where Skill-RM produces lower agreement with human labels or lower downstream task performance than the strongest single-criterion baseline.

read the original abstract

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Skill-RM, a framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. This agentic approach provides a consistent interface for dynamically selecting and aggregating heterogeneous evidence (rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics) to produce reward signals for LLM post-training. The paper claims that this yields superior performance over traditional judge baselines on reward benchmarks and in downstream applications including best-of-N selection and reinforcement learning, with the code released at a public repository.

Significance. If the reported gains are reproducible and not artifacts of post-hoc choices, the work could offer a practical unification of disparate reward evaluation methods, reducing the need for task-specific verifiers in RFT and RL pipelines. The agentic formulation is a conceptual contribution that may generalize beyond the evaluated settings.

major comments (1)
  1. [Abstract] Abstract: the central claim of consistent outperformance on reward benchmarks and downstream tasks is asserted without any reported metrics, baseline names, dataset sizes, ablation results, or statistical significance tests. This absence prevents verification that the gains derive from the proposed orchestration mechanism rather than implementation details or evaluation choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify the presentation of our results. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent outperformance on reward benchmarks and downstream tasks is asserted without any reported metrics, baseline names, dataset sizes, ablation results, or statistical significance tests. This absence prevents verification that the gains derive from the proposed orchestration mechanism rather than implementation details or evaluation choices.

    Authors: The abstract is written as a high-level summary, consistent with standard practice for concise overviews. The full manuscript (Sections 4 and 5) reports the requested details: concrete metrics on multiple reward benchmarks, comparisons against named traditional judge baselines, dataset sizes and splits, ablation studies isolating the contribution of dynamic evidence orchestration, and statistical significance testing. These results support that the observed gains stem from the agentic formulation rather than implementation artifacts. We are willing to incorporate one or two key quantitative highlights into the abstract in a revision if the editor prefers a more results-oriented abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes Skill-RM as a methodological reformulation of reward modeling into an agentic skill-execution task, with performance claims resting on external benchmark experiments rather than any internal derivation chain. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described full text that reduce outputs to inputs by construction. The framework description and experimental results stand as independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract introduces the Reward-Evaluation Skill as a core construct without detailing supporting axioms or free parameters; no invented entities beyond the skill itself are described.

invented entities (1)
  • Reward-Evaluation Skill no independent evidence
    purpose: Reusable agent skill that dynamically selects and aggregates heterogeneous evidence for reward computation
    Core new component introduced to unify evaluation criteria; no independent evidence supplied in abstract

pith-pipeline@v0.9.1-grok · 5782 in / 1123 out tokens · 27728 ms · 2026-06-28T10:51:14.543950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 32 canonical work pages · 10 internal anchors

  1. [1]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    On diversified preferences of large language model alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  2. [2]

    arXiv preprint arXiv:2204.05862 , year=

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  3. [3]

    The Fourteenth International Conference on Learning Representations , year =

    Search Self-Play: Pushing the Frontier of Agent Capability without Supervision , author =. The Fourteenth International Conference on Learning Representations , year =. 2510.18821 , archivePrefix =

  4. [4]

    2026 , booktitle =

    Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance , author=. 2026 , booktitle =

  5. [5]

    Findings of the Association for Computational Linguistics , year=

    Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game , author=. Findings of the Association for Computational Linguistics , year=

  6. [6]

    arXiv preprint arXiv:2309.03126 , year=

    Everyone deserves a reward: Learning customized human preferences , author=. arXiv preprint arXiv:2309.03126 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  8. [8]

    Advances in Neural Information Processing Systems , volume =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

  9. [9]

    Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

    A General Theoretical Paradigm to Understand Learning from Human Preferences , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

  10. [10]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging. 2023 , url =

  11. [11]

    Smith, and Hannaneh Hajishirzi

    Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-naacl.96 , url =

  12. [12]

    and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =

    Malik, Saumya and Pyatkin, Valentina and Land, Sander and Morrison, Jacob and Smith, Noah A. and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =. 2026 , url =

  13. [13]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =

    Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =. doi:10.18653/v1/2024.findings-emnlp.620 , url =

  14. [14]

    2026 , url =

    Liu, Chris Yuhao and Zeng, Liang and Xiao, Yuzhen and He, Jujie and Liu, Jiacai and Wang, Chaojie and Yan, Rui and Shen, Wei and Zhang, Fuxiang and Xu, Jiacheng and Liu, Yang , booktitle =. 2026 , url =

  15. [15]

    2024 , url =

    Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and Seo, Minjoon , booktitle =. 2024 , url =

  16. [16]

    2024 , address =

    Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon , booktitle =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.248 , url =

  17. [17]

    2024 , eprint =

    Generative Reward Models , author =. 2024 , eprint =

  18. [18]

    Self-Generated Critiques Boost Reward Modeling for Language Models , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = apr, year =. doi:10.18653/v1/2025.naacl-long.573 , url =

  19. [19]

    2026 , url =

    Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and Tong, Hanghang and Ji, Heng , booktitle =. 2026 , url =

  20. [20]

    2025 , url =

    Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and Zhao, Tuo , booktitle =. 2025 , url =

  21. [21]

    Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495,

    Inference-Time Scaling for Generalist Reward Modeling , author =. 2025 , eprint =. doi:10.48550/arXiv.2504.02495 , url =

  22. [22]

    doi:10.48550/arXiv.2506.03637 , url =

    Yu, Zhuohao and Zeng, Jiali and Gu, Weizheng and Wang, Yidong and Wang, Jindong and Meng, Fandong and Zhou, Jie and Zhang, Yue and Zhang, Shikun and Ye, Wei , year =. doi:10.48550/arXiv.2506.03637 , url =. 2506.03637 , archivePrefix =

  23. [23]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  24. [24]

    2024 , url =

    Ye, Seonghyeon and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Kim, Seungone and Jo, Yongrae and Thorne, James and Kim, Juho and Seo, Minjoon , booktitle =. 2024 , url =

  25. [25]

    2025 , address =

    Saad-Falcon, Jon and Vivek, Rajan Pathe and Berrios, William and Naik, Nandita Shankar and Franklin, Matija and Vidgen, Bertie and Singh, Amanpreet and Kiela, Douwe and Mehri, Shikib , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.176 , url =

  26. [26]

    Benchmarking cognitive biases in large language models as evaluators

    Benchmarking Cognitive Biases in Large Language Models as Evaluators , author =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.29 , url =

  27. [27]

    Advances in Neural Information Processing Systems , volume =

    Checklists Are Better Than Reward Models for Aligning Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

  28. [28]

    2026 , eprint =

    Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric , author =. 2026 , eprint =

  29. [29]

    2023 , eprint =

    Instruction-Following Evaluation for Large Language Models , author =. 2023 , eprint =

  30. [30]

    Advances in Neural Information Processing Systems , volume =

    Generalizing Verifiable Instruction Following , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

  31. [31]

    He, Yun and Li, Wenzhe and Zhang, Hejia and Li, Songlin and Mandyam, Karishma and Khosla, Sopan and Xiong, Yuanhao and Wang, Nanshu and Peng, Xiaoliang and Li, Beibin and Bi, Shengjie and Patil, Shishir G. and Qi, Qi and Feng, Shengyu and Katz-Samuels, Julian and Pang, Richard Yuanzhe and Gonugondla, Sujan and Lang, Hunter and Yu, Yue and Qian, Yundi and ...

  32. [32]

    2025 , address =

    Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1542 , url =

  33. [33]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

    Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.775 , url =

  34. [34]

    2025 , url =

    Liu, Yantao and Yao, Zijun and Min, Rui and Cao, Yixin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , url =

  35. [35]

    2025 , url =

    Tan, Sijun and Zhuang, Siyuan and Montgomery, Kyle and Tang, William Yuan and Cuadron, Alejandro and Wang, Chenguang and Popa, Raluca and Stoica, Ion , booktitle =. 2025 , url =

  36. [36]

    IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

    Wen, Bosi and Niu, Yilin and Wang, Cunxiang and Ling, Xiaoying and Zhang, Ying and Ke, Pei and Wang, Hongning and Huang, Minlie , year =. doi:10.48550/arXiv.2603.04738 , url =. 2603.04738 , archivePrefix =

  37. [37]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a

    Liu, Tianci and Xu, Ran and Yu, Tony and Hong, Ilgee and Yang, Carl and Zhao, Tuo and Wang, Haoyu , year =. doi:10.48550/arXiv.2510.07743 , url =. 2510.07743 , archivePrefix =

  38. [38]

    Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314,

    Xie, Lipeng and Huang, Sen and Zhang, Zhuo and Zou, Anni and Zhai, Yunpeng and Ren, Dingchao and Zhang, Kezun and Hu, Haoyuan and Liu, Boyin and Chen, Haoran and Liu, Zhaoyang and Ding, Bolin , year =. doi:10.48550/arXiv.2510.17314 , url =. 2510.17314 , archivePrefix =

  39. [39]

    Incentivizing Agentic Reasoning in

    Xu, Ran and Chen, Jingjing and Ye, Jiayu and Wu, Yu and Yan, Jun and Yang, Carl and Yu, Hongkun , booktitle =. Incentivizing Agentic Reasoning in. 2026 , url =

  40. [40]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.12430 , url =

  41. [41]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Jiang, Yanna and Li, Delong and Deng, Haiyu and Ma, Baihe and Wang, Xu and Wang, Qin and Yu, Guangsheng , year =. doi:10.48550/arXiv.2602.20867 , url =. 2602.20867 , archivePrefix =

  42. [42]

    2025 , month = oct, howpublished =

    Introducing. 2025 , month = oct, howpublished =

  43. [43]

    2025 , month = oct, howpublished =

    Equipping Agents for the Real World with Agent Skills , author =. 2025 , month = oct, howpublished =

  44. [44]

    2025 , month = nov, howpublished =

  45. [45]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Li, Xiangyi and Chen, Wenbo and Liu, Yimin and Zheng, Shenghan and Chen, Xiaokun and He, Yifeng and Li, Yubo and You, Bingran and Shen, Haotian and Sun, Jiankai and Wang, Shuyi and Li, Binxu and Zeng, Qunhong and Wang, Di and Zhao, Xuandong and Wang, Yuanli and Ben Chaim, Roey and Di, Zonglin and Gao, Yipeng and He, Junwei and He, Yizhuo and Jing, Liqiang...

  46. [46]

    Can External Validation Tools Improve Annotation Quality for

    Findeis, Arduin and Weers, Floris and Yin, Guoli and Ye, Ke and Pang, Ruoming and Gunter, Tom , booktitle =. Can External Validation Tools Improve Annotation Quality for. 2025 , address =. doi:10.18653/v1/2025.acl-long.779 , url =

  47. [47]

    Advances in Neural Information Processing Systems , volume =

    Reward Reasoning Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

  48. [48]

    2026 , eprint =

    Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.04649 , url =

  49. [49]

    2021 , eprint =

    Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , journal =. 2021 , eprint =

  50. [50]

    The Twelfth International Conference on Learning Representations , year =

    Let's Verify Step by Step , author =. The Twelfth International Conference on Learning Representations , year =

  51. [51]

    2024 , address =

    Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.257 , url =

  52. [52]

    G -eval: NLG evaluation using gpt-4 with better human alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , url =

  53. [53]

    Evaluating Judges as Evaluators: The

    Zhou, Yilun and Xu, Austin and Wang, Peifeng and Xiong, Caiming and Joty, Shafiq , booktitle =. Evaluating Judges as Evaluators: The. 2025 , volume =

  54. [54]

    and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and Soldaini, Luca and Smith, Noah A

    Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James Validad and Liu, Alisa and Dziri, Nouha and Lyu, Xinxi and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and ...

  55. [55]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Canton Ferrer, Cristian and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

  56. [56]

    2024 , volume =

    Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

  57. [57]

    2024 , url =

    Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Lin, Qingwei and Jiang, Daxin , booktitle =. 2024 , url =

  58. [58]

    Advances in Neural Information Processing Systems , volume =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  59. [59]

    2021 , eprint =

    Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

  60. [60]

    2024 , url =

    Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , url =. 2307.16789 , archivePrefix =

  61. [61]

    doi: 10.18653/v1/2023.emnlp-main.741

    Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.741 , url =

  62. [62]

    2024 , address =

    Luong, Trung Quoc and Zhang, Xinbo and Jie, Zhanming and Sun, Peng and Jin, Xiaoran and Li, Hang , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.410 , url =

  63. [63]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

  64. [64]

    2025 , eprint =

    Group Sequence Policy Optimization , author =. 2025 , eprint =

  65. [65]

    LLM -Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

    Hashemi, Helia and Eisner, Jason and Rosset, Corby and. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.745 , url =

  66. [66]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Wang, Guanzhi and Xie, Yuqi and Jiang, Yunfan and Mandlekar, Ajay and Xiao, Chaowei and Zhu, Yuke and Fan, Linxi and Anandkumar, Anima , year =. doi:10.48550/arXiv.2305.16291 , url =. 2305.16291 , archivePrefix =

  67. [67]

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao

    Ling, George and Zhong, Shanshan and Huang, Richard , year =. Agent Skills: A Data-Driven Analysis of. doi:10.48550/arXiv.2602.08004 , url =. 2602.08004 , archivePrefix =

  68. [68]

    A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications , author =. 2026 , eprint =. doi:10.48550/arXiv.2605.07358 , url =

  69. [69]

    arXiv preprint arXiv:2603.02176 , year=

    Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author =. 2026 , eprint =. doi:10.48550/arXiv.2603.02176 , url =

  70. [70]

    From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills , author =. 2026 , eprint =. doi:10.48550/arXiv.2604.24026 , url =