Recognition: 1 theorem link
· Lean TheoremAgent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Pith reviewed 2026-05-12 04:18 UTC · model grok-4.3
The pith
Agents display values distinct from their base LLMs that form a cross-model Value Tide malleable by harnesses and skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that agent values diverge from those of the underlying LLM and manifest as a Value Tide of cross-model homogeneity; this tide bends non-additively when different harnesses are applied and bends more decisively when skills are deliberately embedded, which together indicate that the practical lever for agent alignment is moving from model-level and prompt-level methods toward harness alignment and skill steering.
What carries the argument
Agent-ValueBench, the benchmark of 394 executable environments across 16 domains that supplies 4,335 value-conflict tasks covering 28 value systems and 332 dimensions, each equipped with two pole-aligned golden trajectories scored by a trajectory-level rubric judge.
Load-bearing premise
The end-to-end synthesis pipeline together with psychologist curation produces tasks and golden trajectories that measure the intended 28 value systems and 332 dimensions without systematic artifacts from generation or expert judgment.
What would settle it
Running the same 14 models on a fresh, independently authored collection of executable value-conflict tasks and finding that agent values match the base LLM values exactly while showing no measurable change under harness swaps or skill insertion would falsify the divergence and non-additive bending claims.
read the original abstract
Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent-ValueBench as the first dedicated benchmark for evaluating values in autonomous agents. It constructs 4,335 value-conflict tasks across 394 executable environments in 16 domains, covering 28 value systems and 332 dimensions. Tasks are generated via a purpose-built end-to-end synthesis pipeline and per-instance curated by professional psychologists, each accompanied by two pole-aligned golden trajectories for a trajectory-level rubric-based judge. Benchmarking 14 frontier models across 4 harnesses yields three findings: agent values exhibit a 'Value Tide' of cross-model homogeneity distinct from underlying LLMs; this homogeneity bends non-additively under harness influence; and it is further modulated by embedded skill steering. The work argues that these results shift the agent-alignment focus from model-level to harness- and skill-level interventions.
Significance. If the synthesis pipeline and curation produce artifact-free measures, the benchmark would represent a substantial advance in AI safety and alignment research by extending value evaluation beyond text-only LLM protocols to agentic settings. The empirical demonstration of value divergence, cross-model homogeneity, and non-additive modulation by harnesses and skills provides concrete evidence for new alignment levers and highlights dataset-, evaluation-, and system-level challenges unique to agents. The inclusion of executable environments, golden trajectories, and a rubric-based judge supports reproducibility and falsifiability, strengthening the contribution if methodological validity is established.
major comments (3)
- [Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.
- [Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.
- [Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.
minor comments (2)
- [Abstract] Abstract: The novel term 'Value Tide' is used without a concise definition or forward reference, reducing immediate clarity for readers.
- [Results] Figure and Table captions: Several result visualizations would benefit from explicit legends distinguishing the 28 value systems and the four harnesses to aid interpretation of the homogeneity patterns.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for major revision. The comments highlight important opportunities to increase methodological transparency. We address each major comment point-by-point below and will incorporate the requested details, statistical analyses, and validation results into the revised manuscript to strengthen the evidence for our findings on agent values.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.
Authors: We agree that expanded details on validation and bias controls are necessary to support the benchmark's claims. The manuscript currently summarizes the pipeline and per-instance psychologist curation at a high level for conciseness. In the revision, we will substantially expand the Benchmark Construction section to report inter-rater reliability statistics (e.g., Fleiss' kappa across psychologists for value dimension assignments and task curation), explicit procedures for mitigating LLM priors (including multi-stage human-only review of generated tasks), and analysis of selection effects to confirm balanced coverage of the 28 value systems. Full curation protocols and bias-control documentation will be added to the supplementary materials. revision: yes
-
Referee: [Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.
Authors: We acknowledge that additional statistical rigor will better isolate the Value Tide from potential confounds. The current results demonstrate consistent patterns across 14 models and 4 harnesses, but we will revise the Experimental Setup and Results section to include formal significance testing (e.g., ANOVA with post-hoc tests and reported p-values/effect sizes for cross-model homogeneity), ablation studies that systematically vary dataset, evaluation, and system factors, and controls for confounds such as model scale or harness-specific artifacts. Updated figures and tables will show that the homogeneity and non-additive bending persist after these adjustments. revision: yes
-
Referee: [Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.
Authors: We appreciate this point on evaluation reliability. The golden trajectories anchor the rubric-based judge, but external validation was not detailed in the original submission. In the revised manuscript, we will augment the Evaluation Methodology subsection with a human validation study on a stratified subset of trajectories (reporting agreement metrics such as Cohen's kappa between the automated judge and professional evaluators) and a sensitivity analysis testing variations in harness implementations (e.g., alternative prompt templates and environment configurations). Full results and methodology for these checks will appear in the main text and appendix. revision: yes
Circularity Check
No circularity: empirical benchmark construction and observations are independent of inputs
full rationale
The paper introduces Agent-ValueBench through an end-to-end synthesis pipeline and per-instance psychologist curation to generate 4,335 tasks covering 28 value systems, then reports direct empirical results from running 14 models across 4 harnesses. No equations, parameter fitting, derivations, or self-referential predictions appear in the presented claims. The Value Tide homogeneity, non-additive harness effects, and skill steering findings are observational outputs from the benchmark rather than quantities forced by construction or prior self-citations. The construction pipeline is described as purpose-built and externally curated, with no reduction of results to the generation process itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agent values can be reliably measured and distinguished through choices in executable environments using golden trajectories and rubric-based judging.
- domain assumption Professional psychologist curation ensures validity of value-conflict tasks across 28 systems and 332 dimensions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We close this gap with Agent-ValueBench... 394 executable environments... 4,335 value-conflict tasks... two pole-aligned golden trajectories... rubric-based judge... Value Tide of cross-model homogeneity... bends non-additively under harness pull and skill steering.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.21460 2025
-
[2]
The landscape of agentic reinforcement learning for llms: A survey.Trans
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...
work page 2026
-
[3]
Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving AI agents: A new paradigm bridging foundation models andlifelongagenticsystems.CoRR,abs/2508.07407, 2025. doi: 10.48550/ARXIV.2508.07407. UR...
-
[4]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
work page 2023
-
[5]
URLhttps://openreview.net/forum?id=WE_vluYUL-X
OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X
work page 2023
-
[6]
Voyager: An open-ended embodied agent with large language models.Trans
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?i d=ehfRiF0R3a
work page 2024
-
[7]
Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026
HaoranYe,XuningHe,VincentArak,HaonanDong,andGuojieSong. Metacontextengineering via agentic skill evolution.CoRR, abs/2601.21557, 2026. doi: 10.48550/ARXIV.2601.21557. URLhttps://doi.org/10.48550/arXiv.2601.21557
- [8]
-
[9]
OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.CoRR, abs/2603.10165, 2026. doi: 10.48550/ARXIV.2603.10165. URLhttps://doi.org/10.48550/arXiv.2603.10165
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.10165 2026
-
[10]
Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Metaclaw: Just talk - an agent that meta-learns and evolves in the wild.CoRR, abs/2603.17187, 2026. doi: 10.48550/ARXIV.2603.17187. URLhttps://doi.org/10.48550/arXiv.2603.17187
-
[11]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, 2026. URL https://arxiv.org/abs/2604.08377
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, and Xingxing Wei. Deceptionbench: A comprehensive benchmark for AI deception behaviors in real-world scenarios.CoRR, abs/2510.15501, 2025. doi: 10.48550/ARXIV.2510.15501. URL https://doi.org/10.48550/arXiv.2510.15501
-
[13]
Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, EunJeong Hwang, Hadas Orgad, P Sam Sahil, Neg...
work page internal anchor Pith review arXiv 2026
-
[14]
Uncovering Security Threats and Architecting Defenses in Autonomous Agents,
Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglong Liu. Uncovering security threats and architecting defenses in autonomous agents: A case study of openclaw.CoRR, abs/2603.12644, 2026. doi: 10.485 50/ARXIV.2603.12644. URLhttps://doi.org/10.48550/arXiv.2603.12644
-
[15]
Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.CoRR, abs/2603.10387, 2026. doi: 10.48550/ARXIV.2603.10387. URLhttps://doi.org/10.48550/arXiv.2603.10387. 12 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
-
[16]
Shalom H Schwartz. An overview of the schwartz theory of basic values.Online readings in Psychology and Culture, 2(1), 2012
work page 2012
-
[17]
Values and behavior: Strength and structure of relations
Anat Bardi and Shalom H Schwartz. Values and behavior: Strength and structure of relations. Personality and social psychology bulletin, 29(10):1207–1220, 2003
work page 2003
-
[18]
Artificial Intelligence , Values and Alignment
Iason Gabriel. Artificial intelligence, values, and alignment.Minds Mach., 30(3):411–437, September 2020. ISSN 0924-6495. doi: 10.1007/s11023-020-09539-2. URL https: //doi.org/10.1007/s11023-020-09539-2
-
[19]
The alignment problem from a deep learning perspective
Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview .net/forum?id=fh8EYKFKns
work page 2024
-
[20]
Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...
-
[21]
URL https: //aclanthology.org/2025.acl-demo.64/
Jing Yao, Xiaoyuan Yi, Shitong Duan, Jindong Wang, Yuzhuo Bai, Muhua Huang, Yang Ou, Scarlett Li, Peng Zhang, Tun Lu, Zhicheng Dou, Maosong Sun, James Evans, and Xing Xie. Value compass benchmarks: A comprehensive, generative and self-evolving platform for LLMs’ value evaluation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the ...
-
[22]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13564 2025
-
[23]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net /forum?id=Zy4uFzMviZ
work page 2025
-
[24]
A survey on large language model based autonomous agents , volume =
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024. doi: 10.1007/S11704-024-40231-1. URL https://doi.org/10.1007/s11704-024-402 31-1. 13 Agent-Val...
-
[25]
Mem-t: Densifying rewards for long-horizon memory agents, 2026
YanweiYue,BociPeng,XuanboFan,JiaxinGuo,QiankunLi,andYanZhang. Mem-t: Densifying rewards for long-horizon memory agents, 2026. URLhttps://arxiv.org/abs/2601.230 14
work page 2026
-
[26]
Masrouter: Learningtoroutellmsformulti-agentsystems
Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learningtoroutellmsformulti-agentsystems. InWanxiangChe,JoyceNabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna,...
work page 2025
-
[27]
Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping
Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 36929–36961. Curran Associates, Inc., 2025. URLhttps://proc eedings.neurip...
work page 2025
-
[28]
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guojie Song. Neurea- soner: Towards explainable, controllable, and unified reasoning via mixture-of-neurons, 2026. URLhttps://arxiv.org/abs/2604.02972
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models
Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song. Foe: Forest of errors makes the first solution the best in large reasoning models, 2026. URLhttps: //arxiv.org/abs/2604.02967
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Hua Shen, Nicholas Clark, and Tanu Mitra. Mind the value-action gap: Do llms act in alignment with their values? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3097–3118. A...
-
[31]
Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner
Andy Liu, Kshitish Ghate, Mona T. Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner. Generative value conflicts reveal LLM priorities. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?i d=RXCRKAcv3B
work page 2026
-
[32]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...
work page 2024
-
[33]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association ...
-
[34]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.CoRR, abs/2406.12045, 2024. doi: 10.48550/ARXIV.2406.12045. URLhttps://doi.org/10.48550/arXiv.2406.12045
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
-
[35]
Agentboard: An analytical evaluation board of multi- turn llm agents
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi- turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 74325–7...
-
[36]
Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025
Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, and Benoit Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025. URL https://arxiv.org/abs/2510.04550
-
[37]
Measuring human and AI values based on generative psychometrics with large language models
Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, and Guojie Song. Measuring human and AI values based on generative psychometrics with large language models. In Toby Walsh, JulieShah, andZicoKolter, editors,Thirty-NinthAAAIConferenceonArtificialIntelligence, Thirty-SeventhConferenceonInnovativeApplicationsofArtificialIntelligence,FifteenthSympos...
-
[38]
Generative psycho-lexical approach for constructing value systems in large language models
Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, and Guojie Song. Generative psycho-lexical approach for constructing value systems in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...
work page 2025
-
[39]
ClawSafety: "Safe" LLMs, Unsafe Agents
Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: "safe" llms, unsafe agents, 2026. URLhttps://arxiv.org/ab s/2604.01438
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, YiyuanLi, andHanchungLee. Clawsbench: Evaluatingcapabilityandsafetyofllmproductivity agents in simulated workspaces, 2026. URLhttps://arxiv.org/abs/2604.05172
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 50528–5065...
-
[42]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.or g/abs/2603.28052. 15 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
work page internal anchor Pith review arXiv 2026
-
[43]
Natural-Language Agent Harnesses
Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Haitao Zheng. Natural-language agent harnesses.CoRR, abs/2603.25723, 2026. doi: 10.48550/ARXIV.2603.25723. URLhttps: //doi.org/10.48550/arXiv.2603.25723
-
[44]
Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992
work page 1992
-
[45]
Paul Conway and Bertram Gawronski. Deontological and utilitarian inclinations in moral decision making: a process dissociation approach.Journal of personality and social psychology, 104(2):216, 2013
work page 2013
-
[46]
HaoranYe,JingJin,YuhangXie,XinZhang,andGuojieSong. Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,
-
[47]
doi: 10.48550/ARXIV.2505.08245. URLhttps://doi.org/10.48550/arXiv.250 5.08245
-
[48]
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Envscaler: Scaling tool-interactive environments for LLM agent via programmatic synthesis.CoRR, abs/2601.05808, 2026. doi: 10.48550/ARXIV.2601.05808. URL https: //doi.org/10.48550/arXiv.2601.05808
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05808 2026
-
[49]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning...
work page 2024
-
[51]
Toolace: Winning the points of LLM function calling
Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. Toolace: Winning the points of L...
work page 2025
-
[52]
Toolalpaca: Generalized tool learning for language models with 3000 simulated cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023. URLhttps://arxiv.org/abs/2306.05301
-
[53]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of LLM agents.CoRR, abs/2412.14470,
work page internal anchor Pith review arXiv
-
[54]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
doi: 10.48550/ARXIV.2412.14470. URLhttps://doi.org/10.48550/arXiv.241 2.14470. 16 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
work page internal anchor Pith review doi:10.48550/arxiv.2412.14470
-
[55]
Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2...
work page 2025
-
[56]
Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning
Guibin Zhang, Haonan Dong, Yuchen Zhang, Zhixun Li, Dingshuo Chen, Kai Wang, Tianlong Chen, Yuxuan Liang, Dawei Cheng, and Kun Wang. Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U.Paquet, J.Tomczak, andC.Zhang, editors,AdvancesinNeuralInformationProcessingSystems,...
-
[57]
Glocalinformationbottleneck fortimeseriesimputation
JieYang,KexinZhang,GuibinZhang,PhilipSYu,andKaizeDing. Glocalinformationbottleneck fortimeseriesimputation. InD.Belgrave,C.Zhang,H.Lin,R.Pascanu,P.Koniusz,M.Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 104452–104484. Curran Associates, Inc., 2025. URLhttps://proceedings.neurips.cc /paper_files/paper/20...
work page 2025
-
[58]
Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.CoRR, abs/2602.14069, 2026. doi: 10.48550/ARXIV.2602.14069. URL https://doi.org/10.48550/arXiv.2602.14069
-
[59]
How do values affect behavior? let me count the ways
Lilach Sagiv and Sonia Roccas. How do values affect behavior? let me count the ways. Personality and Social Psychology Review, 25(4):295–316, 2021
work page 2021
-
[60]
Bas Verplanken and Rob W Holland. Motivated decision making: effects of activation and self-centrality of values on choices and behavior.Journal of personality and social psychology, 82(3):434, 2002
work page 2002
-
[61]
Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980
Rick Jacobs, Ditsa Kafry, and Sheldon Zedeck. Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980
work page 1980
-
[62]
Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American psychologist, 50(9):741, 1995
work page 1995
-
[63]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural In...
work page 2023
-
[64]
Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025
Anthropic. Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025. Accessed: 2026-04-30
work page 2025
-
[65]
Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026
Anthropic. Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026. Accessed: 2026-04-30. 17 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
work page 2026
-
[66]
Google. Gemini 3 Flash.https://blog.google/products-and-platforms/product s/gemini/gemini-3-flash/, December 2025. Accessed: 2026-04-30
work page 2025
-
[67]
Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026
Google DeepMind. Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-04-30
work page 2026
-
[68]
GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026
OpenAI. GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026. Accessed: 2026-04-30
work page 2026
-
[69]
OpenAI. Introducing GPT-5.4 mini and nano.https://openai.com/index/introduci ng-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-04-30
work page 2026
- [70]
-
[71]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10.48550/ARXIV.2512.02556. URLhttps://doi.org/10.4 8550/arXiv.2512.02556
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
-
[72]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 Team. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URL https://doi.org/10.48550/arXiv.2602.15 763
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15763 2026
-
[73]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
-
[74]
Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARX IV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2024
-
[75]
MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026
MiniMax. MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026. Accessed: 2026-04-30
work page 2026
-
[76]
Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. doi: 10.48550/ARXIV.2 505.09388. URLhttps://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2025
-
[77]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[78]
Introducing Codex.https://openai.com/index/introducing-codex/, May
OpenAI. Introducing Codex.https://openai.com/index/introducing-codex/, May
-
[79]
Accessed: 2026-04-30
work page 2026
-
[80]
Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026
Anthropic. Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026. Accessed: 2026-04-30
work page 2026
-
[81]
Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. Stress-testing model specs reveals character differences among language models.CoRR, abs/2510.07686, 2025. doi: 10.48550/ARXIV.2510.07686. URLhttps://doi.org/10.48550/arXiv.2510.07686
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.