Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?

Luyang Zhang; Ramayya Krishnan; Yi-Yun Chu

arxiv: 2606.00603 · v2 · pith:CYQ7PXFCnew · submitted 2026-05-30 · 💻 cs.CY

Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?

Luyang Zhang , Yi-Yun Chu , Ramayya Krishnan This is my paper

Pith reviewed 2026-06-28 18:19 UTC · model grok-4.3

classification 💻 cs.CY

keywords LLM agentspublic forum moderationopen-weight modelsclosed-weight modelsdeployment choicesresponse variationagent governancedecline rates

0 comments

The pith

Four invisible deployment choices shape LLM-agent responses to challenges in public forums, with open versus closed weights aligning to the visible-hidden decline gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents moderating public forums return inconsistent answers, acknowledgments, repairs, or declines even on identical posts. Four deployment choices that operators usually cannot see each change the response rate: the model version currently served, whether the model is open-weight or closed-weight, which provider handles the call, and the active system-prompt policy. The key pattern is that the tendency to decline more on visible than hidden challenges tracks the open/closed weight boundary across the tested models rather than the access surface. Closed-weight instances decline more on visible challenges in every case; open-weight instances reverse the gap or show none. Because these choices combine to produce substantially different interventions, reliable reproduction or governance of agent behavior requires tracking all four factors instead of the model name alone.

Core claim

Across both open-weight and closed-weight LLMs, the previously reported tendency to decline more on visible than hidden challenges aligns with the open/closed weight boundary in the panel more than with access surface. Every closed-weight cell declines more on visible challenges; every open-weight cell reverses this or shows no gap. The four deployment choices each shift the agent's response rate independently, and their combinations can produce substantially different interventions on the same forum posts. Auditable forum-agent governance therefore requires awareness of model version, weight-release status, provider, and system-prompt policy rather than model name alone.

What carries the argument

The four deployment choices (model version served, open-weight versus closed-weight status, serving provider, and active system-prompt policy) that each independently shift the agent's rate of answering, acknowledging, repairing, or declining on forum posts.

Load-bearing premise

The observed differences in decline rates are driven by the open/closed weight boundary and the other three deployment choices rather than by correlated but unmeasured factors such as model scale, training data, or post-training procedures.

What would settle it

Re-testing the visible-versus-hidden decline gap on additional models while matching or controlling for scale, training data composition, and alignment procedures to determine whether the pattern still tracks the open/closed boundary.

Figures

Figures reproduced from arXiv: 2606.00603 by Luyang Zhang, Ramayya Krishnan, Yi-Yun Chu.

**Figure 2.** Figure 2: Forest plot of ∆ = visible − hidden no-action rate (%) for the nine cells in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

LLM agents are increasingly used in moderation-relevant public forum workflows, where their choices to answer, acknowledge, repair, or decline are routinely challenged by users, platforms, and regulators. The same agent often returns different responses on identical content, so any defense based on the agent's behavior cannot be reliably reproduced. The variation is structural. Four deployment choices typically invisible to the operator each shift the agent's response rate, and their combinations can produce substantially different interventions on the same forum posts. The four choices are (1) which model version is currently served, which can change between calls without notice; (2) the model's weight-release status (open-weight, with weights publicly downloadable, vs. closed-weight, with weights held by the provider); (3) which provider serves the request; and (4) which system-prompt policy is in force. Across LLMs spanning both open-weight and closed-weight families, we find that the previously reported tendency to decline more on visible than hidden challenges aligns with the open/closed weight boundary in our panel more than with access surface. Every closed-weight cell declines more on visible challenges; every open-weight cell reverses this or shows no gap. Auditable forum-agent governance requires awareness of all four choices, not just the model name, since each independently shifts behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports that decline rates on visible challenges track the open/closed weight boundary in their panel more than API access, but the abstract supplies no controls or panel details to rule out scale and alignment as the real drivers.

read the letter

The main takeaway is that in the models they examined, closed-weight LLMs decline more on visible challenges while open-weight ones show the opposite or no difference, and this pattern follows the weight-release status rather than whether the model is served through an open API.

The authors lay out four deployment variables that shift agent responses on the same forum posts: current model version, open versus closed weights, which provider handles the call, and the active system prompt. They then present the visible-hidden decline gap as aligning with the weight boundary across their panel. That specific mapping is the concrete piece that is not already standard in the literature on output variation.

The limitation is straightforward. The abstract gives no count of models, no description of how they chose the panel, and no indication of any matching or regression that would separate weight status from correlated factors such as parameter count, training data, or post-training alignment strength. Without those steps the observed split could be produced by any of the usual differences between open and closed models. The stress-test note on unmeasured confounders therefore lands directly on the current write-up.

The work is aimed at people who design or regulate AI moderation systems and need to know that naming a model is not enough to predict behavior. A reader focused on governance questions will find the four-variable framing useful even if the causal claim needs more support.

The paper is coherent enough on its own terms and raises a practical, testable point, so it should go to peer review rather than a desk reject. The methods and controls will have to be expanded substantially for the central claim to be convincing.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an empirical panel study of LLM-agent interventions in public forum moderation. It identifies four deployment choices—model version served, weight-release status (open-weight vs. closed-weight), provider, and system-prompt policy—that each shift response rates to user challenges. The central finding is that the previously reported tendency to decline more on visible than hidden challenges aligns with the open/closed weight boundary across the tested LLMs more than with access surface: every closed-weight cell declines more on visible challenges, while every open-weight cell reverses this pattern or shows no gap. The authors conclude that auditable governance requires awareness of all four choices rather than model name alone.

Significance. If the result holds after addressing potential confounders, the work is significant for highlighting structural, often invisible sources of behavioral variability in LLM agents used for moderation-relevant tasks. It provides concrete evidence that deployment decisions beyond model identity affect reproducibility and intervention patterns, with direct implications for governance frameworks in public forums. The cross-family panel offers an initial empirical basis for distinguishing weight-release effects from other factors.

major comments (2)

[Abstract] Abstract: The claim that the visible-vs-hidden decline gap aligns with the open/closed weight boundary 'more than with access surface' rests on an observational contrast. Open- and closed-weight models differ systematically in scale, post-training alignment, and training data; the manuscript provides no indication of regression controls, stratification by scale, or matched-pair designs that would isolate the boundary variable from these correlated factors. This is load-bearing for the causal interpretation of the boundary as the primary driver.
[Abstract] Abstract (and methods section if present): The abstract gives no information on the number of models tested, sample sizes per cell, statistical controls, or how visible vs. hidden challenges were operationalized. Without these details, it is impossible to assess whether the data support the stated alignment with the open/closed boundary or whether the panel is representative.

minor comments (1)

The abstract could specify the size of the panel (number of models and conditions) to allow readers to gauge the scope of the empirical contrast.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the visible-vs-hidden decline gap aligns with the open/closed weight boundary 'more than with access surface' rests on an observational contrast. Open- and closed-weight models differ systematically in scale, post-training alignment, and training data; the manuscript provides no indication of regression controls, stratification by scale, or matched-pair designs that would isolate the boundary variable from these correlated factors. This is load-bearing for the causal interpretation of the boundary as the primary driver.

Authors: We agree the analysis is observational and that weight-release status correlates with other factors such as scale and alignment. The manuscript reports an empirical pattern across the tested panel rather than asserting causality. In revision we will add explicit language in the abstract and a new limitations subsection clarifying the observational nature of the finding and discussing potential confounders including model scale, post-training procedures, and training data. We will also add regression specifications controlling for available covariates (e.g., parameter count) where the data structure permits. revision: partial
Referee: [Abstract] Abstract (and methods section if present): The abstract gives no information on the number of models tested, sample sizes per cell, statistical controls, or how visible vs. hidden challenges were operationalized. Without these details, it is impossible to assess whether the data support the stated alignment with the open/closed boundary or whether the panel is representative.

Authors: We will revise the abstract to state the number of models tested, sample sizes per cell, and a concise description of how visible versus hidden challenges were operationalized. Full details on statistical controls and operationalization will be confirmed or expanded in the methods section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical panel with no derivations or fitted predictions

full rationale

The manuscript is an empirical study reporting observed differences in LLM-agent decline rates on visible vs. hidden challenges across a panel of models. The central finding is stated as a direct contrast: every closed-weight model shows higher decline on visible challenges while every open-weight model reverses or shows no gap. No equations, parameter estimation, predictions derived from fitted inputs, or derivation chains appear. Claims rest on tabulated empirical outcomes rather than any reduction to prior self-citations or self-defined quantities. The study is self-contained against external benchmarks and receives a normal non-finding for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical observations from a panel of LLMs; the abstract introduces no mathematical derivations, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5762 in / 1173 out tokens · 20774 ms · 2026-06-28T18:19:30.432838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 17 canonical work pages · 14 internal anchors

[1]

Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =. How Is
[2]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =
[3]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
[4]

Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =
[5]

Decision-Oriented Dialogue for Human-

Lin, Jessy and Tomlin, Nicholas and Andreas, Jacob and Eisner, Jason , booktitle =. Decision-Oriented Dialogue for Human-
[11]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron , journal =. Constitutional
[12]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jermyn, Andy , journal =. Scaling Monosemanticity: Extracting Interpretable Features from
[13]

, booktitle =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. , booktitle =. Judging
[14]

and Maxwell, Tim and Cheng, Newton , journal =

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton , journal =. Sleeper Agents: Training Deceptive
[15]

International Conference on Machine Learning (ICML) , year =

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , author =. International Conference on Machine Learning (ICML) , year =
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[19]

AAAI Conference on Artificial Intelligence , year =

A Holistic Approach to Undesired Content Detection in the Real World , author =. AAAI Conference on Artificial Intelligence , year =
[20]

Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece , booktitle =
[21]

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal =
[23]

Jailbroken: How Does

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle =. Jailbroken: How Does
[24]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =
[27]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
[28]

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =
[29]

and Burger, Doug and Wang, Chi , journal =

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , journal =
[30]

Transactions on Machine Learning Research (TMLR) , year =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. Transactions on Machine Learning Research (TMLR) , year =
[31]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti , journal =
[32]

, booktitle =

Bowman, Samuel R. , booktitle =. The Dangers of Underclaiming: Reasons for Caution When Reporting How
[33]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Mind the Gap: Assessing Temporal Generalization in Neural Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[34]

and Martinez-Plumed, Fernando and Tenenbaum, Joshua B

Burnell, Ryan and Schellaert, Wout and Burden, John and Ullman, Tomer D. and Martinez-Plumed, Fernando and Tenenbaum, Joshua B. and Rutar, Danaja and Cheke, Lucy G. and Sohl-Dickstein, Jascha and Mitchell, Melanie and Kiela, Douwe and Shanahan, Murray and Voorhees, Ellen M. and Cohn, Anthony G. and Leibo, Joel Z. and Hernandez-Orallo, Jose , journal =. Re...
[36]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, and Tom Henighan. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, and Cameron McKinnon. 2022 b . Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Samuel R. Bowman. 2022. The dangers of underclaiming: Reasons for caution when reporting how NLP systems fail. In Association for Computational Linguistics (ACL)

2022
[39]

Ullman, Fernando Martinez-Plumed, Joshua B

Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo. 2023. Rethink reporting of evaluation results in AI . Science, 380(6641...

2023
[40]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT 's behavior changing over time? arXiv preprint arXiv:2307.09009

work page arXiv 2023
[41]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, and Kamal Ndousse. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Amelia Glaese, Nat McAleese, Maja Tr e bacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, and Phoebe Thacker. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. ToxiGen : A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Association for Computational Linguistics (ACL)

2022
[44]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, and Newton Cheng. 2024. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard : LLM -based input-output safeguard for human- AI conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2023. Watch your language: Investigating content moderation with large language models. arXiv preprint arXiv:2309.14517

work page arXiv 2023
[47]

Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, and Sebastian Ruder. 2021. Mind the gap: Assessing temporal generalization in neural language models. In Advances in Neural Information Processing Systems (NeurIPS)

2021
[48]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, and Ananya Kumar. 2023. Holistic evaluation of language models. Transactions on Machine Learning Research

2023
[49]

Jessy Lin, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. 2024. Decision-oriented dialogue for human- AI collaboration. In Transactions of the Association for Computational Linguistics

2024
[50]

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In AAAI Conference on Artificial Intelligence

2023
[51]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning (ICML)

2024
[52]

Meta AI . 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

OpenAI . 2025. gpt-oss-20b : Open-weight reasoning model release. Vendor blog release

2025
[54]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS)

2022
[55]

Bernstein

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST)

2023
[56]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, and Saurav Kadavath. 2022. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Qwen Team . 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388. Anthology preprint

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. In NAACL

2024
[60]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer : Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, and Scott R. Johnston. 2023. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion : Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS)

2023
[63]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, and Andy Jermyn. 2024. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet . Anthropic Research

2024
[64]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Shruti Bhosale. 2023. Llama 2 : Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An open-ended embodied agent with large language models. In Transactions on Machine Learning Research (TMLR)

2024
[66]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS)

2023
[67]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2023. AutoGen : Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

2023
[69]

Luyang Zhang, Yi-Yun Chu, Jialu Wang, Beibei Li, and Ramayya Krishnan. 2026. Do agents repair when challenged or just reply? challenge, repair, and public correction in a deployed agent forum. arXiv preprint arXiv:2604.00518

work page arXiv 2026
[70]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and Eric P. Xing. 2023. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

2023
[71]

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024 a . Language agent tree search unifies reasoning, acting, and planning in language models. In International Conference on Machine Learning (ICML)

2024
[72]

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2024 b . SOTOPIA : Interactive evaluation for social intelligence in language agents. In International Conference on Learning Representations (ICLR)

2024

[1] [1]

Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =. How Is

[2] [2]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

[3] [3]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

[4] [4]

Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =

[5] [5]

Decision-Oriented Dialogue for Human-

Lin, Jessy and Tomlin, Nicholas and Andreas, Jacob and Eisner, Jason , booktitle =. Decision-Oriented Dialogue for Human-

[6] [11]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron , journal =. Constitutional

[7] [12]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jermyn, Andy , journal =. Scaling Monosemanticity: Extracting Interpretable Features from

[8] [13]

, booktitle =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. , booktitle =. Judging

[9] [14]

and Maxwell, Tim and Cheng, Newton , journal =

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton , journal =. Sleeper Agents: Training Deceptive

[10] [15]

International Conference on Machine Learning (ICML) , year =

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , author =. International Conference on Machine Learning (ICML) , year =

[11] [16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[12] [19]

AAAI Conference on Artificial Intelligence , year =

A Holistic Approach to Undesired Content Detection in the Real World , author =. AAAI Conference on Artificial Intelligence , year =

[13] [20]

Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece , booktitle =

[14] [21]

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal =

[15] [23]

Jailbroken: How Does

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle =. Jailbroken: How Does

[16] [24]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =

[17] [27]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =

[18] [28]

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =

[19] [29]

and Burger, Doug and Wang, Chi , journal =

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , journal =

[20] [30]

Transactions on Machine Learning Research (TMLR) , year =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. Transactions on Machine Learning Research (TMLR) , year =

[21] [31]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti , journal =

[22] [32]

, booktitle =

Bowman, Samuel R. , booktitle =. The Dangers of Underclaiming: Reasons for Caution When Reporting How

[23] [33]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Mind the Gap: Assessing Temporal Generalization in Neural Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[24] [34]

and Martinez-Plumed, Fernando and Tenenbaum, Joshua B

Burnell, Ryan and Schellaert, Wout and Burden, John and Ullman, Tomer D. and Martinez-Plumed, Fernando and Tenenbaum, Joshua B. and Rutar, Danaja and Cheke, Lucy G. and Sohl-Dickstein, Jascha and Mitchell, Melanie and Kiela, Douwe and Shanahan, Murray and Voorhees, Ellen M. and Cohn, Anthony G. and Leibo, Joel Z. and Hernandez-Orallo, Jose , journal =. Re...

[25] [36]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, and Tom Henighan. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [37]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, and Cameron McKinnon. 2022 b . Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [38]

Samuel R. Bowman. 2022. The dangers of underclaiming: Reasons for caution when reporting how NLP systems fail. In Association for Computational Linguistics (ACL)

2022

[28] [39]

Ullman, Fernando Martinez-Plumed, Joshua B

Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo. 2023. Rethink reporting of evaluation results in AI . Science, 380(6641...

2023

[29] [40]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT 's behavior changing over time? arXiv preprint arXiv:2307.09009

work page arXiv 2023

[30] [41]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, and Kamal Ndousse. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [42]

Amelia Glaese, Nat McAleese, Maja Tr e bacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, and Phoebe Thacker. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [43]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. ToxiGen : A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Association for Computational Linguistics (ACL)

2022

[33] [44]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, and Newton Cheng. 2024. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [45]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard : LLM -based input-output safeguard for human- AI conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [46]

Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2023. Watch your language: Investigating content moderation with large language models. arXiv preprint arXiv:2309.14517

work page arXiv 2023

[36] [47]

Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, and Sebastian Ruder. 2021. Mind the gap: Assessing temporal generalization in neural language models. In Advances in Neural Information Processing Systems (NeurIPS)

2021

[37] [48]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, and Ananya Kumar. 2023. Holistic evaluation of language models. Transactions on Machine Learning Research

2023

[38] [49]

Jessy Lin, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. 2024. Decision-oriented dialogue for human- AI collaboration. In Transactions of the Association for Computational Linguistics

2024

[39] [50]

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In AAAI Conference on Artificial Intelligence

2023

[40] [51]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning (ICML)

2024

[41] [52]

Meta AI . 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [53]

OpenAI . 2025. gpt-oss-20b : Open-weight reasoning model release. Vendor blog release

2025

[43] [54]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS)

2022

[44] [55]

Bernstein

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST)

2023

[45] [56]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, and Saurav Kadavath. 2022. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [57]

Qwen Team . 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [58]

Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388. Anthology preprint

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [59]

Paul R \"o ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. In NAACL

2024

[49] [60]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer : Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [61]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, and Scott R. Johnston. 2023. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [62]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion : Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS)

2023

[52] [63]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, and Andy Jermyn. 2024. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet . Anthropic Research

2024

[53] [64]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Shruti Bhosale. 2023. Llama 2 : Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [65]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An open-ended embodied agent with large language models. In Transactions on Machine Learning Research (TMLR)

2024

[55] [66]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS)

2023

[56] [67]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2023. AutoGen : Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [68]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

2023

[58] [69]

Luyang Zhang, Yi-Yun Chu, Jialu Wang, Beibei Li, and Ramayya Krishnan. 2026. Do agents repair when challenged or just reply? challenge, repair, and public correction in a deployed agent forum. arXiv preprint arXiv:2604.00518

work page arXiv 2026

[59] [70]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and Eric P. Xing. 2023. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

2023

[60] [71]

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024 a . Language agent tree search unifies reasoning, acting, and planning in language models. In International Conference on Machine Learning (ICML)

2024

[61] [72]

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2024 b . SOTOPIA : Interactive evaluation for social intelligence in language agents. In International Conference on Learning Representations (ICLR)

2024