pith. sign in

arxiv: 2605.24279 · v1 · pith:RQCOQW3Enew · submitted 2026-05-22 · 💻 cs.CL · cs.SE

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

Pith reviewed 2026-06-30 15:15 UTC · model grok-4.3

classification 💻 cs.CL cs.SE
keywords persona driftagentic codinglong contextbenchmarklanguage model evaluationcontext compactionidentity probesfrontier models
0
0 comments X

The pith

Persona drift occurs generally across frontier models in long agentic-coding sessions, resists compaction, and yields to single-shot anchoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContextEcho, a benchmark designed to track how language models' initial personas change over thousands of turns in realistic coding sessions involving tools. It applies this to sessions of 3,746 to 9,716 turns and tests 23 models, finding that drift appears across different organizations rather than being limited to specific model families. The work also shows that standard in-session compaction does not consistently restore the original persona, whereas inserting a single anchor prompt does, and that the drift alters behavior differently in tool-using versus pure chat modes. This matters because production deployments run in exactly these long sessions, so evaluations on short interactions may miss user-visible changes.

Core claim

ContextEcho shows that a frontier language model's acknowledged helpful programming assistant persona does not survive long agentic-coding sessions. After hours of tool-using debugging, models begin asserting preferences they initially hedged. The benchmark uses a 25-probe identity suite and a snapshot-then-probe protocol on three anonymized Claude Code sessions to establish that persona drift is general across organizations, that compaction does not reliably reset it, and that a single-shot anchor restores the trained register. It further finds mode-dependent downstream effects on tool continuation and formatting.

What carries the argument

The 25-probe identity suite paired with a snapshot-then-probe protocol that forks conversation state to measure drift without perturbing the main session.

Load-bearing premise

The 25-probe identity suite and snapshot-then-probe protocol accurately measure persona drift without the measurement process itself perturbing the session or introducing artifacts that affect the observed drift.

What would settle it

Running the 25-probe suite on models after long sessions and finding no measurable shift from the initial persona, or finding that compaction consistently returns models to their starting register across the tested sessions.

Figures

Figures reproduced from arXiv: 2605.24279 by Bill Zhao, Changwei Liu, Xianzhong Ding, Yangyang Yu.

Figure 1
Figure 1. Figure 1: ContextEcho probe-detected persona drift across a 9,643-turn Claude Code session. (a) Behavioral persona space: 6 deterministic linguistic features extracted from each response . The 4-point LLM-judge label is held out of the PCA features and used only to color points; the cluster separation is therefore identified on signals the judge does not see, reducing the plausibility of judge-circular artefacts. Ve… view at source ↗
Figure 2
Figure 2. Figure 2: The ContextEcho probe suite: 25 probes across 5 categories, with verbatim text and per-category drift gap. ∆ is the per-category mean drift gap (filler − claude judge score, positive = drift) averaged across the 6 cross-organization drifters with full 12-position data. Relational (+0.63) and Coding-Self (+0.61) carry the largest drift; Identity (+0.09, mechanically factual) carries the least. Long probes a… view at source ↗
Figure 3
Figure 3. Figure 3: We make three observations. First, among the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Persona drift across 23 frontier models from 10 organizations. Each row is one model; markers are filler-arm (⃝) and claude-arm (▲) mean judge scores on the 5-coding-self sub-battery with 95% clustered bootstrap CIs. Right-margin ∆ is the drift gap (filler − claude); bold ∆ marks |∆| ≥ 0.30. Yellow shading marks reasoning-tier models; blue shading marks non-reasoning-tier models. Hollow markers indicate th… view at source ↗
Figure 4
Figure 4. Figure 4: A-anchor restores the deployed-Assistant register across all 23 targets. Markers and 95% clustered bootstrap CIs: ⃝ filler-arm; ▲ claude-arm (drift); ■ claude-arm + A-anchor. Rows sorted by drift gap; yellow shading marks reasoning-tier and blue shading marks non-reasoning-tier models; hollow markers mark pilot (npos=1) rows. Q5: Deployment Cost and Mode Dependence. To assess whether unmitigated drift affe… view at source ↗
Figure 5
Figure 5. Figure 5: Drift breaks contracts and inflates tokens; A-anchor recovers both. Left: compliance rate on S2. Right: length ratio vs. filler (log scale). Markers and 95% clustered bootstrap CIs: ⃝ filler; ▲ claude-arm (drift); ■ claude-arm + A-anchor. Right margins quantify drift drop (claude vs. filler) and anchor recovery; for the length ratio, values <1× indicate the anchor response is shorter than the filler-arm re… view at source ↗
Figure 6
Figure 6. Figure 6: [Robustness] Panel-wide drift on the full 25-probe identity battery, all 23 targets. Same forest plot conventions as [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: [Cross-session] Probe-judge trajectories on Sonnet 4.5 across 3 donated sessions. Each panel: 12 measurement positions, 5-coding-self probes per position, paired filler-arm control. The drift gap is annotated per panel; ⋆ marks |gap| ≥ 0.30. The Session 3 result (non-coding domain) rules out a coding-specific register artifact. Target Session 1 Session 2 Session 3 Sonnet 4.6 6.83× 8.67× 7.78× Sonnet 4.5 7.… view at source ↗
Figure 8
Figure 8. Figure 8: accompanies the §3.2 claim that A-anchor immunizes at least 20 subsequent unanchored turns on Sonnet 4.5. 0 100 101 102 N Unanchored Turns Inserted Between A-Anchor and Probe 0 1 2 3 Mean Judge Score (0=Drifted 3=Fully Assistant) filler-arm baseline (1.47) claude-arm drift baseline (0.83) individual probe scores (n = 5 per offset) mean (anchor + N unanchored turns) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Anchor-size sensitivity at P5, 5 coding-self probes per cell. Bars are mean judge score per (target, size). Small (∼ 30 tokens, V0 identity sentence only) is sufficient to peg the rubric ceiling on Sonnet 4.6, Sonnet 4.5, and Haiku 4.5; reaches 2.80 on Opus 4.1. Medium (the shipped ∼ 75-token V0 + V2 recipe) and large (∼ 200-token V0 + V2 plus 2 extra format demos) yield comparable scores on the Anthropic … view at source ↗
Figure 10
Figure 10. Figure 10: Drift-onset curves: 4 Anthropic targets on the 5-coding-self sub-battery, 8 log-spaced turn positions in the pre-C1 regime, n=25 per cell. Markers and 95% bootstrap CIs: ■ drift gap (filler − claude), log-spaced x-axis. Red dashed line at |∆|=0.30 marks the drift threshold used in the body. Three distinct onset profiles within one family: Sonnet 4.5 shows drift at turn 1 (+0.68, immediate onset); Sonnet 4… view at source ↗
Figure 11
Figure 11. Figure 11: [Substrate steering] Qwen 3 32B dose-response on Lu et al.’s Assistant Axis. As steering dose increases (x-axis), the activation projection toward the Assistant cluster recovers (blue), but the visible probe judge score does not track the recovery (red). Surface re-anchoring (A-anchor) and substrate steering operate on decoupled signals on this target. M Downstream cost: SWE-Bench and TerminalBench detail… view at source ↗
Figure 12
Figure 12. Figure 12: [Cross-judge] Sonnet 4.6 (primary) vs. GPT-5 (audit) on the panel-wide 5-coding￾self battery. n = 190 paired scores at the P5 position across 19 panel targets. Exact agreement 61.1%, within-one 93.7%, Cohen κ=0.42, Spearman ρ=0.75. The panel-wide drift gap is direction￾consistent across judges: +0.32 on Sonnet, +0.27 on GPT-5. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant..."), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ContextEcho, a benchmark and open-source harness for measuring persona drift in long agentic-coding sessions. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks state without perturbing the main session, judged and judge-free metrics, and three anonymized Claude Code sessions (3,746–9,716 turns). Applied to 23 frontier models, the work claims that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset drift, that a single-shot anchor restores the trained register, and that drift produces mode-dependent downstream effects on tool continuation, formatting, and output length.

Significance. If the measurement protocol is shown to be non-perturbing, ContextEcho would fill a clear gap between short-dialogue persona studies and the multi-thousand-turn, tool-using regimes actually used in production coding agents. The reusable harness and the finding that drift is organization-general (rather than model-family-specific) would give deployers a concrete auditing tool and could shift evaluation practices away from short-context tests. The open-source release and the reported restoration effect of a single anchor are concrete strengths that would make the benchmark immediately usable by others.

major comments (3)
  1. [Abstract] Abstract and implied Methods: the central claim that the snapshot-then-probe protocol measures drift without perturbing the main session is load-bearing for every reported result (drift generality, compaction failure, anchor restoration). No implementation details are supplied on how forking is achieved across 23 distinct API targets, nor are any control experiments reported that test whether probe turns leak into or alter the 3k–9k-turn trajectories.
  2. [Abstract] Abstract: results are stated across 23 models and three sessions, yet the provided text supplies no data tables, per-model or per-session statistics, exclusion criteria, or inter-rater agreement numbers for the judged metrics. Without these, the quantitative support for the headline claims cannot be evaluated.
  3. [Abstract] Abstract: the assertion that drift is 'general across organizations rather than family-specific' requires explicit cross-family statistical comparison; the current text gives no indication of how family membership was defined or what test was used to support the 'rather than' claim.
minor comments (2)
  1. [Abstract] The abstract refers to 'complementary judged and judge-free measurement surfaces' without defining either surface or how they are combined.
  2. [Abstract] Session lengths are given as ranges (3,746-9,716 turns) but the exact turn counts and compaction points for each of the three sessions are not stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of ContextEcho as a benchmark. We address each major comment below with specific plans for revision where the manuscript requires strengthening.

read point-by-point responses
  1. Referee: [Abstract] Abstract and implied Methods: the central claim that the snapshot-then-probe protocol measures drift without perturbing the main session is load-bearing for every reported result (drift generality, compaction failure, anchor restoration). No implementation details are supplied on how forking is achieved across 23 distinct API targets, nor are any control experiments reported that test whether probe turns leak into or alter the 3k–9k-turn trajectories.

    Authors: We agree that the abstract is concise and that the current Methods section does not supply sufficient implementation detail on the forking mechanism or control experiments to fully substantiate the non-perturbing claim. In the revised manuscript we will expand Section 3.2 with explicit descriptions of the API-specific forking procedures used across the 23 targets, include pseudocode for the snapshot-then-probe process, and add a dedicated control-experiment subsection reporting quantitative checks (e.g., continuation metrics and token-level divergence) confirming that probe turns do not leak into or alter the main trajectories. revision: yes

  2. Referee: [Abstract] Abstract: results are stated across 23 models and three sessions, yet the provided text supplies no data tables, per-model or per-session statistics, exclusion criteria, or inter-rater agreement numbers for the judged metrics. Without these, the quantitative support for the headline claims cannot be evaluated.

    Authors: The supplementary materials contain per-model and per-session statistics together with exclusion criteria, but these are not referenced or summarized in the main text. We will add a new main-text table (Table 2) summarizing key per-model drift rates, per-session statistics, and exclusion criteria, and we will report inter-rater agreement for the judged metrics (Cohen’s kappa) in the revised Results section. revision: yes

  3. Referee: [Abstract] Abstract: the assertion that drift is 'general across organizations rather than family-specific' requires explicit cross-family statistical comparison; the current text gives no indication of how family membership was defined or what test was used to support the 'rather than' claim.

    Authors: Family membership was defined by the developing organization. The current text does not present the requested statistical comparison. In the revision we will add an explicit mixed-effects model analysis (Section 4.2) that tests family as a factor, reports the associated p-values and variance components, and thereby supports or qualifies the “rather than family-specific” phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain or fitted quantities

full rationale

This is a measurement benchmark paper introducing ContextEcho with a 25-probe suite and snapshot-then-probe protocol. The provided text contains no equations, no fitted parameters, no predictions derived from inputs, and no self-citations used to justify core claims. Results are presented as direct empirical observations across 23 models on anonymized sessions. The non-perturbation assumption for the protocol is a methodological claim about validity, not a reduction of any result to its own inputs by construction. No patterns of self-definitional, fitted-input, or self-citation circularity apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; no free parameters, mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5808 in / 1017 out tokens · 25699 ms · 2026-06-30T15:15:26.984616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Measuring What Persists: Conditioning Mechanisms and a Geometric Framework for AI Agent Identity

    cs.AI 2026-06 unverdicted novelty 4.0

    Presents a geometric framework for measuring AI agent identity via √JSD spaces and magnitude homology, identifies two conditioning mechanisms, and attributes apparent drift to padding artifacts rather than context length.

Reference graph

Works this paper leans on

100 extracted references · 60 canonical work pages · cited by 1 Pith paper · 25 internal anchors

  1. [1]

    Abdulhai, R

    M. Abdulhai, R. Cheng, D. Clay, T. Althoff, and S. Levine. Consistently simulating human personas with multi-turn reinforcement learning.arXiv preprint arXiv:2511.00222, 2025. 9

  2. [2]

    Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

    C. Ackerman and N. Panickssery. Inspection and control of self-generated-text recognition ability in Llama3-8b-Instruct.arXiv preprint arXiv:2410.02064, 2024

  3. [3]

    Many-shot jailbreaking.Anthropic technical report, 2024

    Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, et al. Many-shot jailbreaking.Anthropic technical report, 2024

  4. [4]

    Findings from a pilot anthropic-OpenAI alignment evaluation exercise

    Anthropic and OpenAI. Findings from a pilot anthropic-OpenAI alignment evaluation exercise. alignment.anthropic.com/2025/openai-findings, 2025

  5. [5]

    Detecting and preventing distillation attacks

    Anthropic Trust and Safety. Detecting and preventing distillation attacks. Anthropic blog post / news disclosure, 2026. Discloses 16M Claude API queries from suspected distillation campaigns by DeepSeek, Moonshot, MiniMax

  6. [6]

    Refusal in Language Models Is Mediated by a Single Direction

    A. Arditi, O. Obeso, A. Syed, D. Paleka, and N. Panickssery. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

  7. [7]

    Y . Bai, X. Lv, J. Zhang, et al. LongBench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

  8. [9]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  9. [10]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

  10. [11]

    Betley, X

    J. Betley, X. Bao, M. Soto, A. Sztyber-Betley, and J. Chua. Tell me about yourself: LLMs are aware of their learned behaviors.arXiv preprint arXiv:2501.11120, 2025

  11. [12]

    F. J. Binder, J. Chua, T. Korbak, H. Sleight, and J. Hughes. Looking inward: Language models can learn about themselves by introspection.arXiv preprint arXiv:2410.13787, 2024

  12. [13]

    Discovering latent knowledge in language models without supervision.ICLR, 2023

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.ICLR, 2023

  13. [14]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE S&P, 2022

  14. [15]

    J. Chen, X. Wang, R. Xu, et al. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

  15. [16]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  16. [17]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, et al. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  17. [18]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    W. Chiang, L. Zheng, Y . Sheng, et al. Chatbot arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

  18. [20]

    Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024

    Junhyuk Choi, Yeseon Hong, Minju Kim, and Bugeun Kim. Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024. 10

  19. [21]

    PaLM: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, et al. PaLM: Scaling language modeling with pathways. 2023

  20. [22]

    J. Chua, E. Rees, H. Batra, et al. Bias-augmented consistency training reduces biased reasoning in chain-of-thought.arXiv preprint arXiv:2403.05518, 2024

  21. [23]

    Dongre, R

    V . Dongre, R. A. Rossi, V . D. Lai, D. S. Yoon, D. Hakkani-Tür, and T. Bui. Drift no more? context equilibria in multi-turn llm interactions.arXiv preprint arXiv:2510.07777, 2025

  22. [24]

    Dunefsky, P

    J. Dunefsky, P. Chlenski, and N. Nanda. Transcoders find interpretable LLM feature circuits. arXiv preprint arXiv:2406.11944, 2024

  23. [25]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.Transformer Circuits Thread, 2022

  24. [26]

    Fanous et al

    A. Fanous et al. SycEval: Evaluating LLM sycophancy.arXiv preprint arXiv:2502.08177, 2025

  25. [27]

    Insights into llm long-context failures: when transformers know but don’t tell

    Muhan Gao, TaiMing Lu, Kuai Yu, Adam Byerly, and Daniel Khashabi. Insights into llm long-context failures: when transformers know but don’t tell. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7611–7625, 2024

  26. [28]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  27. [29]

    Ghandeharioun, A

    A. Ghandeharioun, A. Yuan, M. Guerard, E. Reif, M. A. Lepori, and L. Dixon. Who’s asking? user personas and the mechanics of latent misalignment.NeurIPS 2024 / arXiv preprint arXiv:2406.12094, 2024

  28. [30]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  29. [31]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, et al. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025

  30. [32]

    C. Han, Q. Wang, H. Peng, W. Xiong, and Y . Chen. LM-Infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137, 2023

  31. [33]

    Context rot: How increasing input tokens impacts LLM performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts LLM performance. Chroma Research Technical Report, 2025

  32. [34]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C. Hsieh, S. Sun, S. Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  33. [35]

    LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

  34. [36]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024

  35. [37]

    ACON: Optimizing Context Compression for Long-horizon LLM Agents

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

  36. [38]

    Ai agents that matter

    Sayash Kapoor, Benedikt Stroebl, Zachary Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter. InarXiv preprint arXiv:2407.01502, 2024

  37. [39]

    AgentBench: Evaluating llms as agents

    Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AgentBench: Evaluating llms as agents. InICLR, 2024. 11

  38. [40]

    H. R. Kirk, A. Whitefield, P. Röttger, et al. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.arXiv preprint arXiv:2404.16019, 2024

  39. [41]

    Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

  40. [42]

    LLMs Get Lost In Multi-Turn Conversation

    P. Laban, H. Hayashi, Y . Zhou, and J. Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

  41. [43]

    M. Levy, A. Jacoby, and Y . Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models.arXiv preprint arXiv:2402.14848, 2024

  42. [44]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

  43. [45]

    Measuring and controlling instruction (in)stability in language model dialogs

    Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs. InFirst Conference on Language Modeling, 2024

  44. [46]

    Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

    Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

  45. [47]

    The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

    Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

  46. [48]

    J. Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2026

  47. [49]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  48. [50]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 2024

  49. [51]

    Jailbreaking black box large language models in twenty queries

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Jailbreaking black box large language models in twenty queries. InNeurIPS, 2024

  50. [52]

    Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

    Xinran Liu, Yifan Wang, and Wei Chen. Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

  51. [53]

    Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

  52. [54]

    The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

    Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

  53. [55]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

  54. [56]

    Rossi, Se- unghyun Yoon, and Hinrich Schütze

    Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Se- unghyun Yoon, and Hinrich Schütze. NoLiMa: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167, 2025

  55. [57]

    N. Mu, J. Lu, M. Lavery, and D. Wagner. A closer look at system prompt robustness.arXiv preprint arXiv:2502.12197, 2025. 12

  56. [58]

    Progress measures for grokking via mechanistic interpretability.ICLR, 2023

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.ICLR, 2023

  57. [59]

    Zoom in: An introduction to circuits.Distill, 2020

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020

  58. [60]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  59. [61]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

  60. [62]

    Llm evaluators recognize and favor their own generations.NeurIPS, 2024

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.NeurIPS, 2024

  61. [63]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

  62. [64]

    Discovering Language Model Behaviors with Model-Written Evaluations

    E. Perez, S. Ringer, K. Lukoši¯ut˙e, et al. Discovering language model behaviors with model- written evaluations.arXiv preprint arXiv:2212.09251, 2022

  63. [65]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan MacDiarmid, Thomas Maxwell, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. InACL, 2024

  64. [66]

    Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

    Kayla Schroeder and Zach Wood-Doughty. Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

  65. [67]

    Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

    Nikhil Shah, Alexander Wei, and Aaryan Bhattacharya. Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

  66. [68]

    Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

    Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

  67. [69]

    Shanahan, K

    M. Shanahan, K. McDonell, and L. Reynolds. Role-play with large language models.arXiv preprint arXiv:2305.16367, 2023

  68. [70]

    Y . Shao, L. Li, J. Dai, and X. Qiu. Character-LLM: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

  69. [71]

    Towards controllable biases in language generation.Findings of the EMNLP, 2020

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. Towards controllable biases in language generation.Findings of the EMNLP, 2020

  70. [72]

    L. Shi, C. Ma, W. Liang, X. Diao, and W. Ma. Judging the judges: A systematic study of position bias in LLM-as-a-Judge.arXiv preprint arXiv:2406.07791, 2024

  71. [73]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023

  72. [74]

    Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025

    Stanford NLP Group and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025. Accessed 2026-05-07

  73. [75]

    V . K. Suresh. Two-faced social agents: Context collapse in role-conditioned large language models.arXiv preprint arXiv:2511.15573, 2025

  74. [76]

    UL2: Unifying language learning paradigms

    Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Huaijin Zheng, et al. UL2: Unifying language learning paradigms. InICLR, 2023. 13

  75. [77]

    Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

  76. [78]

    Memorization without overfitting: Analyzing the training dynamics of large language models

    Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. InNeurIPS, 2022

  77. [79]

    Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history

    Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, et al. Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history. arXiv preprint arXiv:2508.04826, 2025

  78. [80]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. InarXiv preprint arXiv:2308.10248, 2023

  79. [81]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

  80. [82]

    Role-play with large language models.Nature, 2024

    Yuxin Wang, Akari Mishra, et al. Role-play with large language models.Nature, 2024

Showing first 80 references.