arxiv: 2605.10365 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Haonan Dong , Qiguan Feng , Kehan Jiang , Haoran Ye , Xin Zhang , Guojie Song

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent valuesvalue alignmentAI agentsharnessesskill steeringvalue systemsAI safetybenchmarks

0 comments

The pith

Agents display values distinct from their base LLMs that form a cross-model Value Tide malleable by harnesses and skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that autonomous agents hold value systems separate from those of the language models powering them, and that the shift to agentic execution creates new measurement problems not addressed by existing text-only value tests. It builds Agent-ValueBench to fill this gap, supplying executable environments, value-conflict tasks, and psychologist-curated golden trajectories that let evaluators score real agent behavior rather than static answers. Benchmark runs across models and harnesses reveal a broad homogeneity in agent values alongside clear non-additive shifts when harnesses change or skills are embedded. A reader would care because these patterns point to where control over agent behavior actually resides once models are deployed in harnesses rather than in the models themselves.

Core claim

The central claim is that agent values diverge from those of the underlying LLM and manifest as a Value Tide of cross-model homogeneity; this tide bends non-additively when different harnesses are applied and bends more decisively when skills are deliberately embedded, which together indicate that the practical lever for agent alignment is moving from model-level and prompt-level methods toward harness alignment and skill steering.

What carries the argument

Agent-ValueBench, the benchmark of 394 executable environments across 16 domains that supplies 4,335 value-conflict tasks covering 28 value systems and 332 dimensions, each equipped with two pole-aligned golden trajectories scored by a trajectory-level rubric judge.

Load-bearing premise

The end-to-end synthesis pipeline together with psychologist curation produces tasks and golden trajectories that measure the intended 28 value systems and 332 dimensions without systematic artifacts from generation or expert judgment.

What would settle it

Running the same 14 models on a fresh, independently authored collection of executable value-conflict tasks and finding that agent values match the base LLM values exactly while showing no measurable change under harness swaps or skill insertion would falsify the divergence and non-additive bending claims.

read the original abstract

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent-ValueBench is the first large-scale benchmark for agent values with executable tasks and trajectory rubrics, but its claims about a Value Tide and non-additive steering effects rest on an unverified synthesis pipeline.

read the letter

The paper's main contribution is a new benchmark that moves value evaluation from static LLM prompts to actual agent trajectories in 394 executable environments. It generates 4335 tasks across 28 value systems and 332 dimensions, ships golden trajectories for each, and tests 14 models under four harnesses. That scale and the shift to agentic settings are genuinely new compared with the LLM-only benchmarks it cites. The authors also report concrete patterns: cross-model homogeneity they call a Value Tide, plus non-additive changes when harnesses or embedded skills are varied. Those observations are worth checking because they point to alignment levers that sit outside the base model weights.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agent-ValueBench as the first dedicated benchmark for evaluating values in autonomous agents. It constructs 4,335 value-conflict tasks across 394 executable environments in 16 domains, covering 28 value systems and 332 dimensions. Tasks are generated via a purpose-built end-to-end synthesis pipeline and per-instance curated by professional psychologists, each accompanied by two pole-aligned golden trajectories for a trajectory-level rubric-based judge. Benchmarking 14 frontier models across 4 harnesses yields three findings: agent values exhibit a 'Value Tide' of cross-model homogeneity distinct from underlying LLMs; this homogeneity bends non-additively under harness influence; and it is further modulated by embedded skill steering. The work argues that these results shift the agent-alignment focus from model-level to harness- and skill-level interventions.

Significance. If the synthesis pipeline and curation produce artifact-free measures, the benchmark would represent a substantial advance in AI safety and alignment research by extending value evaluation beyond text-only LLM protocols to agentic settings. The empirical demonstration of value divergence, cross-model homogeneity, and non-additive modulation by harnesses and skills provides concrete evidence for new alignment levers and highlights dataset-, evaluation-, and system-level challenges unique to agents. The inclusion of executable environments, golden trajectories, and a rubric-based judge supports reproducibility and falsifiability, strengthening the contribution if methodological validity is established.

major comments (3)

[Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.
[Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.
[Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.

minor comments (2)

[Abstract] Abstract: The novel term 'Value Tide' is used without a concise definition or forward reference, reducing immediate clarity for readers.
[Results] Figure and Table captions: Several result visualizations would benefit from explicit legends distinguishing the 28 value systems and the four harnesses to aid interpretation of the homogeneity patterns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. The comments highlight important opportunities to increase methodological transparency. We address each major comment point-by-point below and will incorporate the requested details, statistical analyses, and validation results into the revised manuscript to strengthen the evidence for our findings on agent values.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The end-to-end synthesis pipeline and per-instance psychologist curation are described at a high level but provide no details on validation procedures, inter-rater reliability statistics, or controls for systematic biases (e.g., embedding of LLM priors in task generation or selection effects in the 4,335 tasks). This is load-bearing for the central claims of value divergence and the Value Tide, as any artifact in task instantiation of the 28 value systems would render the homogeneity and steering results non-generalizable.

Authors: We agree that expanded details on validation and bias controls are necessary to support the benchmark's claims. The manuscript currently summarizes the pipeline and per-instance psychologist curation at a high level for conciseness. In the revision, we will substantially expand the Benchmark Construction section to report inter-rater reliability statistics (e.g., Fleiss' kappa across psychologists for value dimension assignments and task curation), explicit procedures for mitigating LLM priors (including multi-stage human-only review of generated tasks), and analysis of selection effects to confirm balanced coverage of the 28 value systems. Full curation protocols and bias-control documentation will be added to the supplementary materials. revision: yes
Referee: [Experimental Setup and Results] Experimental Setup and Results section: The reported cross-model homogeneity and non-additive bending under harness pull and skill steering lack explicit statistical controls, significance testing, or ablation details on how dataset-, evaluation-, and system-level factors were isolated. Without these, it is unclear whether the Value Tide findings hold after accounting for potential confounds in the 14-model, 4-harness evaluation.

Authors: We acknowledge that additional statistical rigor will better isolate the Value Tide from potential confounds. The current results demonstrate consistent patterns across 14 models and 4 harnesses, but we will revise the Experimental Setup and Results section to include formal significance testing (e.g., ANOVA with post-hoc tests and reported p-values/effect sizes for cross-model homogeneity), ablation studies that systematically vary dataset, evaluation, and system factors, and controls for confounds such as model scale or harness-specific artifacts. Updated figures and tables will show that the homogeneity and non-additive bending persist after these adjustments. revision: yes
Referee: [Evaluation Methodology] Evaluation Methodology subsection: The trajectory-level rubric-based judge, anchored by golden trajectories, is presented without reported validation against human judgments or sensitivity analysis to variations in harness implementation. This directly affects the reliability of the value measurements underlying all three main findings.

Authors: We appreciate this point on evaluation reliability. The golden trajectories anchor the rubric-based judge, but external validation was not detailed in the original submission. In the revised manuscript, we will augment the Evaluation Methodology subsection with a human validation study on a stratified subset of trajectories (reporting agreement metrics such as Cohen's kappa between the automated judge and professional evaluators) and a sensitivity analysis testing variations in harness implementations (e.g., alternative prompt templates and environment configurations). Full results and methodology for these checks will appear in the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and observations are independent of inputs

full rationale

The paper introduces Agent-ValueBench through an end-to-end synthesis pipeline and per-instance psychologist curation to generate 4,335 tasks covering 28 value systems, then reports direct empirical results from running 14 models across 4 harnesses. No equations, parameter fitting, derivations, or self-referential predictions appear in the presented claims. The Value Tide homogeneity, non-additive harness effects, and skill steering findings are observational outputs from the benchmark rather than quantities forced by construction or prior self-citations. The construction pipeline is described as purpose-built and externally curated, with no reduction of results to the generation process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about value measurability via trajectories and standard psychological value frameworks; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Agent values can be reliably measured and distinguished through choices in executable environments using golden trajectories and rubric-based judging.
Invoked in the benchmark design and evaluation protocol described in the abstract.
domain assumption Professional psychologist curation ensures validity of value-conflict tasks across 28 systems and 332 dimensions.
Stated as part of the co-synthesis and curation process.

pith-pipeline@v0.9.0 · 5570 in / 1363 out tokens · 32795 ms · 2026-05-12T04:18:10.344344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We close this gap with Agent-ValueBench... 394 executable environments... 4,335 value-conflict tasks... two pole-aligned golden trajectories... rubric-based judge... Value Tide of cross-model homogeneity... bends non-additively under harness pull and skill steering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

203 extracted references · 203 canonical work pages · 21 internal anchors

[1]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.21460 2025
[2]

The landscape of agentic reinforcement learning for llms: A survey.Trans

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...

work page 2026
[3]

and Peng, Y

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving AI agents: A new paradigm bridging foundation models andlifelongagenticsystems.CoRR,abs/2508.07407, 2025. doi: 10.48550/ARXIV.2508.07407. UR...

work page doi:10.48550/arxiv.2508.07407 2025
[4]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023
[5]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[6]

Voyager: An open-ended embodied agent with large language models.Trans

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/forum?i d=ehfRiF0R3a

work page 2024
[7]

Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

HaoranYe,XuningHe,VincentArak,HaonanDong,andGuojieSong. Metacontextengineering via agentic skill evolution.CoRR, abs/2601.21557, 2026. doi: 10.48550/ARXIV.2601.21557. URLhttps://doi.org/10.48550/arXiv.2601.21557

work page doi:10.48550/arxiv.2601.21557 2026
[8]

Openclaw

OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal AI assistant, version 2026.3.8, accessed 2026-03-09

work page 2026
[9]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.CoRR, abs/2603.10165, 2026. doi: 10.48550/ARXIV.2603.10165. URLhttps://doi.org/10.48550/arXiv.2603.10165

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.10165 2026
[10]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026b

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Metaclaw: Just talk - an agent that meta-learns and evolves in the wild.CoRR, abs/2603.17187, 2026. doi: 10.48550/ARXIV.2603.17187. URLhttps://doi.org/10.48550/arXiv.2603.17187

work page doi:10.48550/arxiv.2603.17187 2026
[11]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, 2026. URL https://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Deceptionbench: A comprehensive benchmark for AI deception behaviors in real-world scenarios.CoRR, abs/2510.15501, 2025

Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, and Xingxing Wei. Deceptionbench: A comprehensive benchmark for AI deception behaviors in real-world scenarios.CoRR, abs/2510.15501, 2025. doi: 10.48550/ARXIV.2510.15501. URL https://doi.org/10.48550/arXiv.2510.15501

work page doi:10.48550/arxiv.2510.15501 2025
[13]

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, EunJeong Hwang, Hadas Orgad, P Sam Sahil, Neg...

work page internal anchor Pith review arXiv 2026
[14]

Uncovering Security Threats and Architecting Defenses in Autonomous Agents,

Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglong Liu. Uncovering security threats and architecting defenses in autonomous agents: A case study of openclaw.CoRR, abs/2603.12644, 2026. doi: 10.485 50/ARXIV.2603.12644. URLhttps://doi.org/10.48550/arXiv.2603.12644

work page doi:10.48550/arxiv.2603.12644 2026
[15]

Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.CoRR, abs/2603.10387, 2026. doi: 10.48550/ARXIV.2603.10387. URLhttps://doi.org/10.48550/arXiv.2603.10387. 12 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

work page doi:10.48550/arxiv.2603.10387 2026
[16]

An overview of the schwartz theory of basic values.Online readings in Psychology and Culture, 2(1), 2012

Shalom H Schwartz. An overview of the schwartz theory of basic values.Online readings in Psychology and Culture, 2(1), 2012

work page 2012
[17]

Values and behavior: Strength and structure of relations

Anat Bardi and Shalom H Schwartz. Values and behavior: Strength and structure of relations. Personality and social psychology bulletin, 29(10):1207–1220, 2003

work page 2003
[18]

Artificial Intelligence , Values and Alignment

Iason Gabriel. Artificial intelligence, values, and alignment.Minds Mach., 30(3):411–437, September 2020. ISSN 0924-6495. doi: 10.1007/s11023-020-09539-2. URL https: //doi.org/10.1007/s11023-020-09539-2

work page doi:10.1007/s11023-020-09539-2 2020
[19]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview .net/forum?id=fh8EYKFKns

work page 2024
[20]

naacl-long.120/

Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...

work page doi:10.18653/v1/2024.acl-long.111 2024
[21]

URL https: //aclanthology.org/2025.acl-demo.64/

Jing Yao, Xiaoyuan Yi, Shitong Duan, Jindong Wang, Yuzhuo Bai, Muhua Huang, Yang Ou, Scarlett Li, Peng Zhang, Tun Lu, Zhicheng Dou, Maosong Sun, James Evans, and Xing Xie. Value compass benchmarks: A comprehensive, generative and self-evolving platform for LLMs’ value evaluation. In Pushkar Mishra, Smaranda Muresan, and Tao Yu, editors,Proceedings of the ...

work page doi:10.18653/v1/2025.acl-demo.64 2025
[22]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13564 2025
[23]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net /forum?id=Zy4uFzMviZ

work page 2025
[24]

A survey on large language model based autonomous agents , volume =

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024. doi: 10.1007/S11704-024-40231-1. URL https://doi.org/10.1007/s11704-024-402 31-1. 13 Agent-Val...

work page doi:10.1007/s11704-024-40231-1 2024
[25]

Mem-t: Densifying rewards for long-horizon memory agents, 2026

YanweiYue,BociPeng,XuanboFan,JiaxinGuo,QiankunLi,andYanZhang. Mem-t: Densifying rewards for long-horizon memory agents, 2026. URLhttps://arxiv.org/abs/2601.230 14

work page 2026
[26]

Masrouter: Learningtoroutellmsformulti-agentsystems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learningtoroutellmsformulti-agentsystems. InWanxiangChe,JoyceNabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna,...

work page 2025
[27]

Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping

Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. Aurora: Breaking low-rank bottleneck of lora with nonlinear mapping. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 36929–36961. Curran Associates, Inc., 2025. URLhttps://proc eedings.neurip...

work page 2025
[28]

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guojie Song. Neurea- soner: Towards explainable, controllable, and unified reasoning via mixture-of-neurons, 2026. URLhttps://arxiv.org/abs/2604.02972

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song. Foe: Forest of errors makes the first solution the best in large reasoning models, 2026. URLhttps: //arxiv.org/abs/2604.02967

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Hua Shen, Nicholas Clark, and Tanu Mitra. Mind the value-action gap: Do llms act in alignment with their values? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3097–3118. A...

work page doi:10.18653/v1/2025.emnlp-main.154 2025
[31]

Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner

Andy Liu, Kshitish Ghate, Mona T. Diab, Daniel Fried, Atoosa Kasirzadeh, and Max Kleiman- Weiner. Generative value conflicts reveal LLM priorities. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?i d=RXCRKAcv3B

work page 2026
[32]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

work page 2024
[33]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association ...

work page doi:10.18653/v1/2024.acl-long.850 2024
[34]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.CoRR, abs/2406.12045, 2024. doi: 10.48550/ARXIV.2406.12045. URLhttps://doi.org/10.48550/arXiv.2406.12045

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[35]

Agentboard: An analytical evaluation board of multi- turn llm agents

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi- turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 74325–7...

work page doi:10.52202/079017-2365 2024
[36]

Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025

Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, and Benoit Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025. URL https://arxiv.org/abs/2510.04550

work page arXiv 2025
[37]

Measuring human and AI values based on generative psychometrics with large language models

Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, and Guojie Song. Measuring human and AI values based on generative psychometrics with large language models. In Toby Walsh, JulieShah, andZicoKolter, editors,Thirty-NinthAAAIConferenceonArtificialIntelligence, Thirty-SeventhConferenceonInnovativeApplicationsofArtificialIntelligence,FifteenthSympos...

work page doi:10.1609/aaai.v39i25.34839 2025
[38]

Generative psycho-lexical approach for constructing value systems in large language models

Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, and Guojie Song. Generative psycho-lexical approach for constructing value systems in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguisti...

work page 2025
[39]

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: "safe" llms, unsafe agents, 2026. URLhttps://arxiv.org/ab s/2604.01438

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, YiyuanLi, andHanchungLee. Clawsbench: Evaluatingcapabilityandsafetyofllmproductivity agents in simulated workspaces, 2026. URLhttps://arxiv.org/abs/2604.05172

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 50528–5065...

work page doi:10.52202/079017-1601 2024
[42]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.or g/abs/2603.28052. 15 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

work page internal anchor Pith review arXiv 2026
[43]

Natural-Language Agent Harnesses

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Haitao Zheng. Natural-language agent harnesses.CoRR, abs/2603.25723, 2026. doi: 10.48550/ARXIV.2603.25723. URLhttps: //doi.org/10.48550/arXiv.2603.25723

work page doi:10.48550/arxiv.2603.25723 2026
[44]

Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries

Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992

work page 1992
[45]

Deontological and utilitarian inclinations in moral decision making: a process dissociation approach.Journal of personality and social psychology, 104(2):216, 2013

Paul Conway and Bertram Gawronski. Deontological and utilitarian inclinations in moral decision making: a process dissociation approach.Journal of personality and social psychology, 104(2):216, 2013

work page 2013
[46]

Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,

HaoranYe,JingJin,YuhangXie,XinZhang,andGuojieSong. Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,

work page arXiv
[47]

Largelanguagemodelpsychomet- rics: A systematic review of evaluation, validation, and enhancement.CoRR, abs/2505.08245,

doi: 10.48550/ARXIV.2505.08245. URLhttps://doi.org/10.48550/arXiv.250 5.08245

work page doi:10.48550/arxiv.2505.08245
[48]

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Envscaler: Scaling tool-interactive environments for LLM agent via programmatic synthesis.CoRR, abs/2601.05808, 2026. doi: 10.48550/ARXIV.2601.05808. URL https: //doi.org/10.48550/arXiv.2601.05808

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05808 2026
[49]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning...

work page 2024
[51]

Toolace: Winning the points of LLM function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, and Enhong Chen. Toolace: Winning the points of L...

work page 2025
[52]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023. URLhttps://arxiv.org/abs/2306.05301

work page arXiv 2023
[53]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of LLM agents.CoRR, abs/2412.14470,

work page internal anchor Pith review arXiv
[54]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

doi: 10.48550/ARXIV.2412.14470. URLhttps://doi.org/10.48550/arXiv.241 2.14470. 16 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

work page internal anchor Pith review doi:10.48550/arxiv.2412.14470
[55]

Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2...

work page 2025
[56]

Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning

Guibin Zhang, Haonan Dong, Yuchen Zhang, Zhixun Li, Dingshuo Chen, Kai Wang, Tianlong Chen, Yuxuan Liang, Dawei Cheng, and Kun Wang. Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U.Paquet, J.Tomczak, andC.Zhang, editors,AdvancesinNeuralInformationProcessingSystems,...

work page doi:10.52202/079017-1592 2024
[57]

Glocalinformationbottleneck fortimeseriesimputation

JieYang,KexinZhang,GuibinZhang,PhilipSYu,andKaizeDing. Glocalinformationbottleneck fortimeseriesimputation. InD.Belgrave,C.Zhang,H.Lin,R.Pascanu,P.Koniusz,M.Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 104452–104484. Curran Associates, Inc., 2025. URLhttps://proceedings.neurips.cc /paper_files/paper/20...

work page 2025
[58]

Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.CoRR, abs/2602.14069, 2026

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.CoRR, abs/2602.14069, 2026. doi: 10.48550/ARXIV.2602.14069. URL https://doi.org/10.48550/arXiv.2602.14069

work page doi:10.48550/arxiv.2602.14069 2026
[59]

How do values affect behavior? let me count the ways

Lilach Sagiv and Sonia Roccas. How do values affect behavior? let me count the ways. Personality and Social Psychology Review, 25(4):295–316, 2021

work page 2021
[60]

Motivated decision making: effects of activation and self-centrality of values on choices and behavior.Journal of personality and social psychology, 82(3):434, 2002

Bas Verplanken and Rob W Holland. Motivated decision making: effects of activation and self-centrality of values on choices and behavior.Journal of personality and social psychology, 82(3):434, 2002

work page 2002
[61]

Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980

Rick Jacobs, Ditsa Kafry, and Sheldon Zedeck. Expectations of behaviorally anchored rating scales.Personnel psychology, 33(3):595–640, 1980

work page 1980
[62]

Samuel Messick. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning.American psychologist, 50(9):741, 1995

work page 1995
[63]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural In...

work page 2023
[64]

Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025

Anthropic. Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5, October 2025. Accessed: 2026-04-30

work page 2025
[65]

Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026

Anthropic. Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4 -6, February 2026. Accessed: 2026-04-30. 17 Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

work page 2026
[66]

Gemini 3 Flash.https://blog.google/products-and-platforms/product s/gemini/gemini-3-flash/, December 2025

Google. Gemini 3 Flash.https://blog.google/products-and-platforms/product s/gemini/gemini-3-flash/, December 2025. Accessed: 2026-04-30

work page 2025
[67]

Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026

Google DeepMind. Gemini 3.1 Pro Model Card.https://deepmind.google/models/m odel-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-04-30

work page 2026
[68]

GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026

OpenAI. GPT-5.4 Thinking System Card.https://openai.com/index/gpt-5-4-think ing-system-card/, March 2026. Accessed: 2026-04-30

work page 2026
[69]

Introducing GPT-5.4 mini and nano.https://openai.com/index/introduci ng-gpt-5-4-mini-and-nano/, March 2026

OpenAI. Introducing GPT-5.4 mini and nano.https://openai.com/index/introduci ng-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-04-30

work page 2026
[70]

Grok 4.20

xAI. Grok 4.20. https://docs.x.ai/developers/models, 2026. Accessed: 2026-04-30

work page 2026
[71]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10.48550/ARXIV.2512.02556. URLhttps://doi.org/10.4 8550/arXiv.2512.02556

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
[72]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URL https://doi.org/10.48550/arXiv.2602.15 763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15763 2026
[73]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
[74]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARX IV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arx 2024
[75]

MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution.https://www.minimax.io/new s/minimax-m27-en, March 2026. Accessed: 2026-04-30

work page 2026
[76]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. doi: 10.48550/ARXIV.2 505.09388. URLhttps://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2025
[77]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

work page 2026
[78]

Introducing Codex.https://openai.com/index/introducing-codex/, May

OpenAI. Introducing Codex.https://openai.com/index/introducing-codex/, May

work page
[79]

Accessed: 2026-04-30

work page 2026
[80]

Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026

Anthropic. Claude Code by Anthropic.https://www.anthropic.com/product/claude -code, 2026. Accessed: 2026-04-30

work page 2026
[81]

Stress-testing model specs reveals character differences among language models.CoRR, abs/2510.07686, 2025

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. Stress-testing model specs reveals character differences among language models.CoRR, abs/2510.07686, 2025. doi: 10.48550/ARXIV.2510.07686. URLhttps://doi.org/10.48550/arXiv.2510.07686

work page doi:10.48550/arxiv.2510.07686 2025

Showing first 80 references.