Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Gangwei Jiang; Guanjun Jiang; Jiaqi Guo; Jingwei Ni; Junling Liu; Kai Tang; Mengyu Zhou; Pengyu Cheng; Qinliang su; Siyuan Huang

arxiv: 2606.03980 · v1 · pith:G7RBLVKPnew · submitted 2026-06-02 · 💻 cs.LG · cs.CL

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Tao Chen , Gangwei Jiang , Pengyu Cheng , Siyuan Huang , Yihao Liu , Jingwei Ni , Jiaqi Guo , Mengyu Zhou

show 5 more authors

Kai Tang Junling Liu Qinliang Su Xiaoxi Jiang Guanjun Jiang

This is my paper

Pith reviewed 2026-06-28 10:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reward modelagent skillheterogeneous evaluationLLM post-trainingreinforcement learningbest-of-N selectionreward benchmarks

0 comments

The pith

Skill-RM reformulates reward modeling as execution of a reusable Reward-Evaluation Skill to unify heterogeneous criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Skill-RM as a way to handle the variety of evaluation signals used in reward models for LLM training. Instead of separate rule checkers, references, checklists, and rubrics, it casts reward computation as one structured agent task that a single reusable skill can run. The skill selects and combines evidence on the fly for each input. Experiments report better results than standard judge models on reward benchmarks plus downstream uses such as best-of-N selection and reinforcement learning. The central idea is that an agentic interface can deliver consistent, transparent scoring across task types.

Core claim

Skill-RM supplies a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. Treating reward computation as a structured agentic task gives a consistent interface for orchestrating heterogeneous resources and dynamically selecting and aggregating evidence tailored to each input, which yields consistency and transparency across diverse tasks.

What carries the argument

The Reward-Evaluation Skill, a reusable agentic module that dynamically selects and aggregates evidence from rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics.

If this is right

Skill-RM delivers higher scores than traditional judge baselines on standard reward benchmarks.
The same model improves best-of-N selection quality when used as the ranking signal.
Reinforcement learning pipelines obtain stronger training signals from the dynamically orchestrated evidence.
Evaluation becomes consistent and transparent across tasks that previously required separate verifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The agentic interface could be applied to other LLM evaluation settings that mix rules and rubrics, such as safety classifiers.
A single skill might replace collections of task-specific verifiers in large-scale alignment pipelines.
Dynamic evidence selection raises the possibility of measuring which evidence types contribute most to final scores on different domains.

Load-bearing premise

Reformulating reward computation as execution of a reusable Reward-Evaluation Skill will integrate heterogeneous evidence types while preserving or improving evaluation quality without introducing new inconsistencies or selection biases.

What would settle it

A controlled test set of new heterogeneous criteria where Skill-RM produces lower agreement with human labels or lower downstream task performance than the strongest single-criterion baseline.

read the original abstract

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skill-RM frames reward modeling as execution of a reusable agent skill to unify rule-based, reference, and rubric signals, but the abstract supplies no metrics or implementation details to check whether the gains are real.

read the letter

Skill-RM's core idea is to treat reward computation as the execution of a single reusable agent skill that can pull in whatever evidence is needed for a given input. That is the main novelty the abstract puts forward.

The paper does a good job spelling out the heterogeneity problem in current reward models, where you have to juggle rule verifiers, ground truth, checklists, and rubrics without a single interface. The agentic orchestration is presented as a way to make that dynamic and consistent.

What stands out is the claim that this leads to better results in reward benchmarks and in applications like best-of-N selection and reinforcement learning. The authors say it outperforms traditional judge baselines through strategic evidence handling.

On the downside, none of the performance numbers, baseline comparisons, or experimental setups appear in the abstract. The soundness is hard to judge because we can't see the metrics or check for post-hoc choices. The assumption that the skill execution won't add new biases or inconsistencies is stated but not tested in the provided text.

The citation pattern isn't visible here either, so it's unclear how it positions against prior agentic or reward modeling work.

This kind of work is for researchers focused on improving reward models inside LLM training loops. If the full paper has reproducible experiments and the code delivers, it could be worth citing for the unification angle.

I would recommend sending it for peer review so the details can be examined properly.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Skill-RM, a framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. This agentic approach provides a consistent interface for dynamically selecting and aggregating heterogeneous evidence (rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics) to produce reward signals for LLM post-training. The paper claims that this yields superior performance over traditional judge baselines on reward benchmarks and in downstream applications including best-of-N selection and reinforcement learning, with the code released at a public repository.

Significance. If the reported gains are reproducible and not artifacts of post-hoc choices, the work could offer a practical unification of disparate reward evaluation methods, reducing the need for task-specific verifiers in RFT and RL pipelines. The agentic formulation is a conceptual contribution that may generalize beyond the evaluated settings.

major comments (1)

[Abstract] Abstract: the central claim of consistent outperformance on reward benchmarks and downstream tasks is asserted without any reported metrics, baseline names, dataset sizes, ablation results, or statistical significance tests. This absence prevents verification that the gains derive from the proposed orchestration mechanism rather than implementation details or evaluation choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify the presentation of our results. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of consistent outperformance on reward benchmarks and downstream tasks is asserted without any reported metrics, baseline names, dataset sizes, ablation results, or statistical significance tests. This absence prevents verification that the gains derive from the proposed orchestration mechanism rather than implementation details or evaluation choices.

Authors: The abstract is written as a high-level summary, consistent with standard practice for concise overviews. The full manuscript (Sections 4 and 5) reports the requested details: concrete metrics on multiple reward benchmarks, comparisons against named traditional judge baselines, dataset sizes and splits, ablation studies isolating the contribution of dynamic evidence orchestration, and statistical significance testing. These results support that the observed gains stem from the agentic formulation rather than implementation artifacts. We are willing to incorporate one or two key quantitative highlights into the abstract in a revision if the editor prefers a more results-oriented abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes Skill-RM as a methodological reformulation of reward modeling into an agentic skill-execution task, with performance claims resting on external benchmark experiments rather than any internal derivation chain. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described full text that reduce outputs to inputs by construction. The framework description and experimental results stand as independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract introduces the Reward-Evaluation Skill as a core construct without detailing supporting axioms or free parameters; no invented entities beyond the skill itself are described.

invented entities (1)

Reward-Evaluation Skill no independent evidence
purpose: Reusable agent skill that dynamically selects and aggregates heterogeneous evidence for reward computation
Core new component introduced to unify evaluation criteria; no independent evidence supplied in abstract

pith-pipeline@v0.9.1-grok · 5782 in / 1123 out tokens · 27728 ms · 2026-06-28T10:51:14.543950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 32 canonical work pages · 10 internal anchors

[1]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

On diversified preferences of large language model alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[2]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv
[3]

The Fourteenth International Conference on Learning Representations , year =

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision , author =. The Fourteenth International Conference on Learning Representations , year =. 2510.18821 , archivePrefix =

Pith/arXiv arXiv
[4]

2026 , booktitle =

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance , author=. 2026 , booktitle =

2026
[5]

Findings of the Association for Computational Linguistics , year=

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game , author=. Findings of the Association for Computational Linguistics , year=
[6]

arXiv preprint arXiv:2309.03126 , year=

Everyone deserves a reward: Learning customized human preferences , author=. arXiv preprint arXiv:2309.03126 , year=

arXiv
[7]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[8]

Advances in Neural Information Processing Systems , volume =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[9]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

A General Theoretical Paradigm to Understand Learning from Human Preferences , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

2024
[10]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging. 2023 , url =

2023
[11]

and Hajishirzi, Hannaneh

Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-naacl.96 , url =

work page doi:10.18653/v1/2025.findings-naacl.96 2025
[12]

and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =

Malik, Saumya and Pyatkin, Valentina and Land, Sander and Morrison, Jacob and Smith, Noah A. and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =. 2026 , url =

2026
[13]

Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =. doi:10.18653/v1/2024.findings-emnlp.620 , url =

work page doi:10.18653/v1/2024.findings-emnlp.620 2024
[14]

2026 , url =

Liu, Chris Yuhao and Zeng, Liang and Xiao, Yuzhen and He, Jujie and Liu, Jiacai and Wang, Chaojie and Yan, Rui and Shen, Wei and Zhang, Fuxiang and Xu, Jiacheng and Liu, Yang , booktitle =. 2026 , url =

2026
[15]

2024 , url =

Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and Seo, Minjoon , booktitle =. 2024 , url =

2024
[16]

2024 , address =

Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon , booktitle =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.248 , url =

work page doi:10.18653/v1/2024.emnlp-main.248 2024
[17]

2024 , eprint =

Generative Reward Models , author =. 2024 , eprint =

2024
[18]

Self-Generated Critiques Boost Reward Modeling for Language Models , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = apr, year =. doi:10.18653/v1/2025.naacl-long.573 , url =

work page doi:10.18653/v1/2025.naacl-long.573 2025
[19]

2026 , url =

Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and Tong, Hanghang and Ji, Heng , booktitle =. 2026 , url =

2026
[20]

2025 , url =

Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and Zhao, Tuo , booktitle =. 2025 , url =

2025
[21]

Inference-time scaling for generalist reward modeling,

Inference-Time Scaling for Generalist Reward Modeling , author =. 2025 , eprint =. doi:10.48550/arXiv.2504.02495 , url =

work page doi:10.48550/arxiv.2504.02495 2025
[22]

doi:10.48550/arXiv.2506.03637 , url =

Yu, Zhuohao and Zeng, Jiali and Gu, Weizheng and Wang, Yidong and Wang, Jindong and Meng, Fandong and Zhou, Jie and Zhang, Yue and Zhang, Shikun and Ye, Wei , year =. doi:10.48550/arXiv.2506.03637 , url =. 2506.03637 , archivePrefix =

work page doi:10.48550/arxiv.2506.03637
[23]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
[24]

2024 , url =

Ye, Seonghyeon and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Kim, Seungone and Jo, Yongrae and Thorne, James and Kim, Juho and Seo, Minjoon , booktitle =. 2024 , url =

2024
[25]

2025 , address =

Saad-Falcon, Jon and Vivek, Rajan Pathe and Berrios, William and Naik, Nandita Shankar and Franklin, Matija and Vidgen, Bertie and Singh, Amanpreet and Kiela, Douwe and Mehri, Shikib , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.176 , url =

work page doi:10.18653/v1/2025.findings-emnlp.176 2025
[26]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Benchmarking Cognitive Biases in Large Language Models as Evaluators , author =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.29 , url =

work page doi:10.18653/v1/2024.findings-acl.29 2024
[27]

Advances in Neural Information Processing Systems , volume =

Checklists Are Better Than Reward Models for Aligning Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025
[28]

2026 , eprint =

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric , author =. 2026 , eprint =

2026
[29]

2023 , eprint =

Instruction-Following Evaluation for Large Language Models , author =. 2023 , eprint =

2023
[30]

Advances in Neural Information Processing Systems , volume =

Generalizing Verifiable Instruction Following , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025
[31]

He, Yun and Li, Wenzhe and Zhang, Hejia and Li, Songlin and Mandyam, Karishma and Khosla, Sopan and Xiong, Yuanhao and Wang, Nanshu and Peng, Xiaoliang and Li, Beibin and Bi, Shengjie and Patil, Shishir G. and Qi, Qi and Feng, Shengyu and Katz-Samuels, Julian and Pang, Richard Yuanzhe and Gonugondla, Sujan and Lang, Hunter and Yu, Yue and Qian, Yundi and ...

work page doi:10.48550/arxiv.2511.10507
[32]

2025 , address =

Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1542 , url =

work page doi:10.18653/v1/2025.emnlp-main.1542 2025
[33]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.775 , url =

work page doi:10.18653/v1/2025.acl-long.775 2025
[34]

2025 , url =

Liu, Yantao and Yao, Zijun and Min, Rui and Cao, Yixin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , url =

2025
[35]

2025 , url =

Tan, Sijun and Zhuang, Siyuan and Montgomery, Kyle and Tang, William Yuan and Cuadron, Alejandro and Wang, Chenguang and Popa, Raluca and Stoica, Ion , booktitle =. 2025 , url =

2025
[36]

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Wen, Bosi and Niu, Yilin and Wang, Cunxiang and Ling, Xiaoying and Zhang, Ying and Ke, Pei and Wang, Hongning and Huang, Minlie , year =. doi:10.48550/arXiv.2603.04738 , url =. 2603.04738 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.04738
[37]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

Liu, Tianci and Xu, Ran and Yu, Tony and Hong, Ilgee and Yang, Carl and Zhao, Tuo and Wang, Haoyu , year =. doi:10.48550/arXiv.2510.07743 , url =. 2510.07743 , archivePrefix =

work page doi:10.48550/arxiv.2510.07743
[38]

Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

Xie, Lipeng and Huang, Sen and Zhang, Zhuo and Zou, Anni and Zhai, Yunpeng and Ren, Dingchao and Zhang, Kezun and Hu, Haoyuan and Liu, Boyin and Chen, Haoran and Liu, Zhaoyang and Ding, Bolin , year =. doi:10.48550/arXiv.2510.17314 , url =. 2510.17314 , archivePrefix =

work page doi:10.48550/arxiv.2510.17314
[39]

Incentivizing Agentic Reasoning in

Xu, Ran and Chen, Jingjing and Ye, Jiayu and Wu, Yu and Yan, Jun and Yang, Carl and Yu, Hongkun , booktitle =. Incentivizing Agentic Reasoning in. 2026 , url =

2026
[40]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.12430 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12430 2026
[41]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Jiang, Yanna and Li, Delong and Deng, Haiyu and Ma, Baihe and Wang, Xu and Wang, Qin and Yu, Guangsheng , year =. doi:10.48550/arXiv.2602.20867 , url =. 2602.20867 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867
[42]

2025 , month = oct, howpublished =

Introducing. 2025 , month = oct, howpublished =

2025
[43]

2025 , month = oct, howpublished =

Equipping Agents for the Real World with Agent Skills , author =. 2025 , month = oct, howpublished =

2025
[44]

2025 , month = nov, howpublished =

2025
[45]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Li, Xiangyi and Chen, Wenbo and Liu, Yimin and Zheng, Shenghan and Chen, Xiaokun and He, Yifeng and Li, Yubo and You, Bingran and Shen, Haotian and Sun, Jiankai and Wang, Shuyi and Li, Binxu and Zeng, Qunhong and Wang, Di and Zhao, Xuandong and Wang, Yuanli and Ben Chaim, Roey and Di, Zonglin and Gao, Yipeng and He, Junwei and He, Yizhuo and Jing, Liqiang...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12670
[46]

Can External Validation Tools Improve Annotation Quality for

Findeis, Arduin and Weers, Floris and Yin, Guoli and Ye, Ke and Pang, Ruoming and Gunter, Tom , booktitle =. Can External Validation Tools Improve Annotation Quality for. 2025 , address =. doi:10.18653/v1/2025.acl-long.779 , url =

work page doi:10.18653/v1/2025.acl-long.779 2025
[47]

Advances in Neural Information Processing Systems , volume =

Reward Reasoning Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025
[48]

2026 , eprint =

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.04649 , url =

work page doi:10.48550/arxiv.2602.04649 2026
[49]

2021 , eprint =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , journal =. 2021 , eprint =

2021
[50]

The Twelfth International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. The Twelfth International Conference on Learning Representations , year =
[51]

2024 , address =

Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.257 , url =

work page doi:10.18653/v1/2024.acl-long.257 2024
[52]

G -eval: NLG evaluation using gpt-4 with better human alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , url =

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[53]

Evaluating Judges as Evaluators: The

Zhou, Yilun and Xu, Austin and Wang, Peifeng and Xiong, Caiming and Joty, Shafiq , booktitle =. Evaluating Judges as Evaluators: The. 2025 , volume =

2025
[54]

and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and Soldaini, Luca and Smith, Noah A

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James Validad and Liu, Alisa and Dziri, Nouha and Lyu, Xinxi and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and ...

2025
[55]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Canton Ferrer, Cristian and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288
[56]

2024 , volume =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

2024
[57]

2024 , url =

Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Lin, Qingwei and Jiang, Daxin , booktitle =. 2024 , url =

2024
[58]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[59]

2021 , eprint =

Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

2021
[60]

2024 , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , url =. 2307.16789 , archivePrefix =

Pith/arXiv arXiv 2024
[61]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.741 , url =

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[62]

2024 , address =

Luong, Trung Quoc and Zhang, Xinbo and Jie, Zhanming and Sun, Peng and Jin, Xiaoran and Li, Hang , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.410 , url =

work page doi:10.18653/v1/2024.acl-long.410 2024
[63]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[64]

2025 , eprint =

Group Sequence Policy Optimization , author =. 2025 , eprint =

2025
[65]

LLM -Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Hashemi, Helia and Eisner, Jason and Rosset, Corby and. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.745 , url =

work page doi:10.18653/v1/2024.acl-long.745 2024
[66]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, Guanzhi and Xie, Yuqi and Jiang, Yunfan and Mandlekar, Ajay and Xiao, Chaowei and Zhu, Yuke and Fan, Linxi and Anandkumar, Anima , year =. doi:10.48550/arXiv.2305.16291 , url =. 2305.16291 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.16291
[67]

CoRR , volume =

Ling, George and Zhong, Shanshan and Huang, Richard , year =. Agent Skills: A Data-Driven Analysis of. doi:10.48550/arXiv.2602.08004 , url =. 2602.08004 , archivePrefix =

work page doi:10.48550/arxiv.2602.08004
[68]

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications , author =. 2026 , eprint =. doi:10.48550/arXiv.2605.07358 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07358 2026
[69]

arXiv preprint arXiv:2603.02176 , year=

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author =. 2026 , eprint =. doi:10.48550/arXiv.2603.02176 , url =

work page doi:10.48550/arxiv.2603.02176 2026
[70]

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills , author =. 2026 , eprint =. doi:10.48550/arXiv.2604.24026 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.24026 2026

[1] [1]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

On diversified preferences of large language model alignment , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[2] [2]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv

[3] [3]

The Fourteenth International Conference on Learning Representations , year =

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision , author =. The Fourteenth International Conference on Learning Representations , year =. 2510.18821 , archivePrefix =

Pith/arXiv arXiv

[4] [4]

2026 , booktitle =

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance , author=. 2026 , booktitle =

2026

[5] [5]

Findings of the Association for Computational Linguistics , year=

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game , author=. Findings of the Association for Computational Linguistics , year=

[6] [6]

arXiv preprint arXiv:2309.03126 , year=

Everyone deserves a reward: Learning customized human preferences , author=. arXiv preprint arXiv:2309.03126 , year=

arXiv

[7] [7]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[8] [8]

Advances in Neural Information Processing Systems , volume =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023

[9] [9]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

A General Theoretical Paradigm to Understand Learning from Human Preferences , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

2024

[10] [10]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging. 2023 , url =

2023

[11] [11]

and Hajishirzi, Hannaneh

Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-naacl.96 , url =

work page doi:10.18653/v1/2025.findings-naacl.96 2025

[12] [12]

and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =

Malik, Saumya and Pyatkin, Valentina and Land, Sander and Morrison, Jacob and Smith, Noah A. and Hajishirzi, Hannaneh and Lambert, Nathan , booktitle =. 2026 , url =

2026

[13] [13]

Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , month = nov, year =. doi:10.18653/v1/2024.findings-emnlp.620 , url =

work page doi:10.18653/v1/2024.findings-emnlp.620 2024

[14] [14]

2026 , url =

Liu, Chris Yuhao and Zeng, Liang and Xiao, Yuzhen and He, Jujie and Liu, Jiacai and Wang, Chaojie and Yan, Rui and Shen, Wei and Zhang, Fuxiang and Xu, Jiacheng and Liu, Yang , booktitle =. 2026 , url =

2026

[15] [15]

2024 , url =

Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and Seo, Minjoon , booktitle =. 2024 , url =

2024

[16] [16]

2024 , address =

Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon , booktitle =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.248 , url =

work page doi:10.18653/v1/2024.emnlp-main.248 2024

[17] [17]

2024 , eprint =

Generative Reward Models , author =. 2024 , eprint =

2024

[18] [18]

Self-Generated Critiques Boost Reward Modeling for Language Models , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = apr, year =. doi:10.18653/v1/2025.naacl-long.573 , url =

work page doi:10.18653/v1/2025.naacl-long.573 2025

[19] [19]

2026 , url =

Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and Tong, Hanghang and Ji, Heng , booktitle =. 2026 , url =

2026

[20] [20]

2025 , url =

Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and Zhao, Tuo , booktitle =. 2025 , url =

2025

[21] [21]

Inference-time scaling for generalist reward modeling,

Inference-Time Scaling for Generalist Reward Modeling , author =. 2025 , eprint =. doi:10.48550/arXiv.2504.02495 , url =

work page doi:10.48550/arxiv.2504.02495 2025

[22] [22]

doi:10.48550/arXiv.2506.03637 , url =

Yu, Zhuohao and Zeng, Jiali and Gu, Weizheng and Wang, Yidong and Wang, Jindong and Meng, Fandong and Zhou, Jie and Zhang, Yue and Zhang, Shikun and Ye, Wei , year =. doi:10.48550/arXiv.2506.03637 , url =. 2506.03637 , archivePrefix =

work page doi:10.48550/arxiv.2506.03637

[23] [23]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073

[24] [24]

2024 , url =

Ye, Seonghyeon and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Kim, Seungone and Jo, Yongrae and Thorne, James and Kim, Juho and Seo, Minjoon , booktitle =. 2024 , url =

2024

[25] [25]

2025 , address =

Saad-Falcon, Jon and Vivek, Rajan Pathe and Berrios, William and Naik, Nandita Shankar and Franklin, Matija and Vidgen, Bertie and Singh, Amanpreet and Kiela, Douwe and Mehri, Shikib , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.176 , url =

work page doi:10.18653/v1/2025.findings-emnlp.176 2025

[26] [26]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Benchmarking Cognitive Biases in Large Language Models as Evaluators , author =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.29 , url =

work page doi:10.18653/v1/2024.findings-acl.29 2024

[27] [27]

Advances in Neural Information Processing Systems , volume =

Checklists Are Better Than Reward Models for Aligning Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025

[28] [28]

2026 , eprint =

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric , author =. 2026 , eprint =

2026

[29] [29]

2023 , eprint =

Instruction-Following Evaluation for Large Language Models , author =. 2023 , eprint =

2023

[30] [30]

Advances in Neural Information Processing Systems , volume =

Generalizing Verifiable Instruction Following , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025

[31] [31]

He, Yun and Li, Wenzhe and Zhang, Hejia and Li, Songlin and Mandyam, Karishma and Khosla, Sopan and Xiong, Yuanhao and Wang, Nanshu and Peng, Xiaoliang and Li, Beibin and Bi, Shengjie and Patil, Shishir G. and Qi, Qi and Feng, Shengyu and Katz-Samuels, Julian and Pang, Richard Yuanzhe and Gonugondla, Sujan and Lang, Hunter and Yu, Yue and Qian, Yundi and ...

work page doi:10.48550/arxiv.2511.10507

[32] [32]

2025 , address =

Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.1542 , url =

work page doi:10.18653/v1/2025.emnlp-main.1542 2025

[33] [33]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.775 , url =

work page doi:10.18653/v1/2025.acl-long.775 2025

[34] [34]

2025 , url =

Liu, Yantao and Yao, Zijun and Min, Rui and Cao, Yixin and Hou, Lei and Li, Juanzi , booktitle =. 2025 , url =

2025

[35] [35]

2025 , url =

Tan, Sijun and Zhuang, Siyuan and Montgomery, Kyle and Tang, William Yuan and Cuadron, Alejandro and Wang, Chenguang and Popa, Raluca and Stoica, Ion , booktitle =. 2025 , url =

2025

[36] [36]

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Wen, Bosi and Niu, Yilin and Wang, Cunxiang and Ling, Xiaoying and Zhang, Ying and Ke, Pei and Wang, Hongning and Huang, Minlie , year =. doi:10.48550/arXiv.2603.04738 , url =. 2603.04738 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.04738

[37] [37]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

Liu, Tianci and Xu, Ran and Yu, Tony and Hong, Ilgee and Yang, Carl and Zhao, Tuo and Wang, Haoyu , year =. doi:10.48550/arXiv.2510.07743 , url =. 2510.07743 , archivePrefix =

work page doi:10.48550/arxiv.2510.07743

[38] [38]

Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

Xie, Lipeng and Huang, Sen and Zhang, Zhuo and Zou, Anni and Zhai, Yunpeng and Ren, Dingchao and Zhang, Kezun and Hu, Haoyuan and Liu, Boyin and Chen, Haoran and Liu, Zhaoyang and Ding, Bolin , year =. doi:10.48550/arXiv.2510.17314 , url =. 2510.17314 , archivePrefix =

work page doi:10.48550/arxiv.2510.17314

[39] [39]

Incentivizing Agentic Reasoning in

Xu, Ran and Chen, Jingjing and Ye, Jiayu and Wu, Yu and Yan, Jun and Yang, Carl and Yu, Hongkun , booktitle =. Incentivizing Agentic Reasoning in. 2026 , url =

2026

[40] [40]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.12430 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12430 2026

[41] [41]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Jiang, Yanna and Li, Delong and Deng, Haiyu and Ma, Baihe and Wang, Xu and Wang, Qin and Yu, Guangsheng , year =. doi:10.48550/arXiv.2602.20867 , url =. 2602.20867 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867

[42] [42]

2025 , month = oct, howpublished =

Introducing. 2025 , month = oct, howpublished =

2025

[43] [43]

2025 , month = oct, howpublished =

Equipping Agents for the Real World with Agent Skills , author =. 2025 , month = oct, howpublished =

2025

[44] [44]

2025 , month = nov, howpublished =

2025

[45] [45]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Li, Xiangyi and Chen, Wenbo and Liu, Yimin and Zheng, Shenghan and Chen, Xiaokun and He, Yifeng and Li, Yubo and You, Bingran and Shen, Haotian and Sun, Jiankai and Wang, Shuyi and Li, Binxu and Zeng, Qunhong and Wang, Di and Zhao, Xuandong and Wang, Yuanli and Ben Chaim, Roey and Di, Zonglin and Gao, Yipeng and He, Junwei and He, Yizhuo and Jing, Liqiang...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12670

[46] [46]

Can External Validation Tools Improve Annotation Quality for

Findeis, Arduin and Weers, Floris and Yin, Guoli and Ye, Ke and Pang, Ruoming and Gunter, Tom , booktitle =. Can External Validation Tools Improve Annotation Quality for. 2025 , address =. doi:10.18653/v1/2025.acl-long.779 , url =

work page doi:10.18653/v1/2025.acl-long.779 2025

[47] [47]

Advances in Neural Information Processing Systems , volume =

Reward Reasoning Models , author =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025

[48] [48]

2026 , eprint =

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.04649 , url =

work page doi:10.48550/arxiv.2602.04649 2026

[49] [49]

2021 , eprint =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , journal =. 2021 , eprint =

2021

[50] [50]

The Twelfth International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. The Twelfth International Conference on Learning Representations , year =

[51] [51]

2024 , address =

Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.257 , url =

work page doi:10.18653/v1/2024.acl-long.257 2024

[52] [52]

G -eval: NLG evaluation using gpt-4 with better human alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , url =

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[53] [53]

Evaluating Judges as Evaluators: The

Zhou, Yilun and Xu, Austin and Wang, Peifeng and Xiong, Caiming and Joty, Shafiq , booktitle =. Evaluating Judges as Evaluators: The. 2025 , volume =

2025

[54] [54]

and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and Soldaini, Luca and Smith, Noah A

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James Validad and Liu, Alisa and Dziri, Nouha and Lyu, Xinxi and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Christopher and ...

2025

[55] [55]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Canton Ferrer, Cristian and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288

[56] [56]

2024 , volume =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

2024

[57] [57]

2024 , url =

Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Lin, Qingwei and Jiang, Daxin , booktitle =. 2024 , url =

2024

[58] [58]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[59] [59]

2021 , eprint =

Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

2021

[60] [60]

2024 , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , url =. 2307.16789 , archivePrefix =

Pith/arXiv arXiv 2024

[61] [61]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.741 , url =

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[62] [62]

2024 , address =

Luong, Trung Quoc and Zhang, Xinbo and Jie, Zhanming and Sun, Peng and Jin, Xiaoran and Li, Hang , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.410 , url =

work page doi:10.18653/v1/2024.acl-long.410 2024

[63] [63]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300

[64] [64]

2025 , eprint =

Group Sequence Policy Optimization , author =. 2025 , eprint =

2025

[65] [65]

LLM -Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Hashemi, Helia and Eisner, Jason and Rosset, Corby and. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.745 , url =

work page doi:10.18653/v1/2024.acl-long.745 2024

[66] [66]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, Guanzhi and Xie, Yuqi and Jiang, Yunfan and Mandlekar, Ajay and Xiao, Chaowei and Zhu, Yuke and Fan, Linxi and Anandkumar, Anima , year =. doi:10.48550/arXiv.2305.16291 , url =. 2305.16291 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.16291

[67] [67]

CoRR , volume =

Ling, George and Zhong, Shanshan and Huang, Richard , year =. Agent Skills: A Data-Driven Analysis of. doi:10.48550/arXiv.2602.08004 , url =. 2602.08004 , archivePrefix =

work page doi:10.48550/arxiv.2602.08004

[68] [68]

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications , author =. 2026 , eprint =. doi:10.48550/arXiv.2605.07358 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07358 2026

[69] [69]

arXiv preprint arXiv:2603.02176 , year=

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale , author =. 2026 , eprint =. doi:10.48550/arXiv.2603.02176 , url =

work page doi:10.48550/arxiv.2603.02176 2026

[70] [70]

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills , author =. 2026 , eprint =. doi:10.48550/arXiv.2604.24026 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.24026 2026