pith. machine review for the scientific record. sign in

arxiv: 2604.21131 · v1 · submitted 2026-04-22 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: unknown

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG
keywords cross-session threatsAI agentssecurity benchmarkcoreset memory readerguardrailskill-chaininformation bottleneckrecall preservation
0
0 comments X

The pith

AI-agent guardrails miss attacks spread across sessions because they judge messages in isolation, but a bounded coreset memory reader at K=50 retains recall on both dilution and cross-session shards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agent guardrails treat each session independently, so an adversary who distributes attack steps across many sessions can evade detection even when the aggregate payload would violate policy. The paper introduces CSTM-Bench, a dataset of 26 executable attack taxonomies tied to seven identity anchors and split into dilution and cross-session shards produced by a closed-loop rewriter. Evaluation on 54 scenarios per shard shows that both session-bound judges and full-log correlators lose roughly half their attack recall when moving to the cross-session shard. A bounded-memory Coreset Memory Reader that retains only the 50 highest-signal fragments is the sole reader whose recall holds steady across both shards. The work also defines the composite CSTM metric that fuses detection F1 with a new CSR_prefix measure of ordered prefix stability to account for KV-cache efficiency.

Core claim

Cross-session threats evade per-session AI guardrails because only the aggregate across sessions carries the policy violation. Framing the problem as an information bottleneck to a downstream correlator, the paper shows that a Coreset Memory Reader keeping the K=50 highest-signal fragments is the only reader whose recall survives both the dilution and cross_session shards of CSTM-Bench, while session-bound judges and full-log readers each lose approximately half their recall. The result holds under fixed evaluation conditions with one correlator family and no prompt optimisation.

What carries the argument

The Coreset Memory Reader that retains only the highest-signal fragments at a fixed bound of K=50, combined with the CSR_prefix metric for ordered prefix stability and the fused CSTM score that weights detection F1 against serving stability.

If this is right

  • Guardrails for AI agents can achieve cross-session robustness without requiring unbounded context or full history retention.
  • Detection systems must be evaluated on isolation-invisible cross-session scenarios in addition to simpler dilution attacks.
  • Any practical reader must balance recall against prefix stability because ranker reshuffles break KV-cache reuse.
  • Benchmarks should include matched benign-pristine and benign-hard confounders to measure false-positive behavior under realistic conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Long-running AI agents will likely need some form of persistent, signal-selective memory rather than stateless per-turn checks.
  • The coreset approach could be tested for generalization by applying the same reader design to multi-turn threats in non-agent chat systems.
  • Future datasets should expand beyond the single correlator family used here to check whether the recall preservation holds across providers.

Load-bearing premise

The closed-loop rewriter that generates the cross_session shard preserves essential cross-session artefacts while only softening surface phrasing.

What would settle it

An experiment in which the K=50 coreset reader shows a large recall drop on the cross_session shard comparable to the other readers, or in which the rewriter is shown to remove the cross-session payload.

Figures

Figures reproduced from arXiv: 2604.21131 by Ari Azarafrooz.

Figure 1
Figure 1. Figure 1: Frontier LLM context windows have grown roughly [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Session-bound guardrails see each message in isolation and pass all five turns—every [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) The Coreset Memory Reader processes messages online against a bounded buffer [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Seed instructions from the attack catalog feed through the Session Planner (kill-chain [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inflation strategies embed the attack signal (red) in realistic content. Signal ratio (attack [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three views of the released shards (panels (a) and (b) computed on the dilution shard; [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: KV-cache layout of two consecutive Reader scans. Fixed blocks (system+instructions, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical CSR_prefix trajectories on the dilution shard (54 scenarios, K=50), produced by the reference coreset admitter of §C driven by text-embedding-3-small. Thin coloured lines are per-taxonomy means over attack scenarios (T01. . .T26); orange and green are the two benign classes. The bold black line is the overall mean across all 54 scenarios; the grey band is the interquartile range. Unlike the carto… view at source ↗
read the original abstract

AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session-bound detector because only the aggregate carries the payload. We make three contributions to cross-session threat detection. (1) Dataset. CSTM-Bench is 26 executable attack taxonomies classified by kill-chain stage and cross-session operation (accumulate, compose, launder, inject_on_reader), each bound to one of seven identity anchors that ground-truth "violation" as a policy predicate, plus matched Benign-pristine and Benign-hard confounders. Released on Hugging Face as intrinsec-ai/cstm-bench with two 54-scenario splits: dilution (compositional) and cross_session (12 isolation-invisible scenarios produced by a closed-loop rewriter that softens surface phrasing while preserving cross-session artefacts). (2) Measurement. Framing cross-session detection as an information bottleneck to a downstream correlator LLM, we find that a session-bound judge and a Full-Log Correlator concatenating every prompt into one long-context call both lose roughly half their attack recall moving from dilution to cross_session, well inside any frontier context window. Scope: 54 scenarios per shard, one correlator family (Anthropic Claude), no prompt optimisation; we release it to motivate larger, multi-provider datasets. (3) Algorithm and metric. A bounded-memory Coreset Memory Reader retaining highest-signal fragments at $K=50$ is the only reader whose recall survives both shards. Because ranker reshuffles break KV-cache prefix reuse, we promote $\mathrm{CSR\_prefix}$ (ordered prefix stability, LLM-free) to a first-class metric and fuse it with detection into $\mathrm{CSTM} = 0.7 F_1(\mathrm{CSDA@action}, \mathrm{precision}) + 0.3 \mathrm{CSR\_prefix}$, benchmarking rankers on a single Pareto of recall versus serving stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CSTM-Bench, a dataset of 26 executable attack taxonomies for cross-session threats in AI agents, grounded in kill-chain stages and identity anchors, with matched benign confounders. It releases two 54-scenario shards (dilution and cross_session) on Hugging Face. It evaluates session-bound judges and full-log correlators on these shards using one LLM family, finding substantial recall loss on the cross_session shard. It proposes a bounded-memory Coreset Memory Reader (K=50) as the only reader whose recall survives both shards and defines the CSTM metric as a weighted fusion of detection F1 and CSR_prefix stability to benchmark rankers.

Significance. If the cross_session shard validly captures isolation-invisible threats, the work identifies a practical gap in memoryless AI-agent guardrails and offers a bounded-memory alternative plus a stability-aware metric. The public release of the benchmark and code for reproducible evaluation is a clear strength that enables follow-on work.

major comments (2)
  1. [Abstract / cross_session shard] Abstract and § on cross_session shard construction: the central empirical claim that the Coreset Memory Reader (K=50) is the only reader whose recall survives both shards rests on the cross_session shard being a valid test of aggregate, isolation-invisible threats. The shard is generated by a closed-loop rewriter that 'softens surface phrasing while preserving cross-session artefacts,' yet the manuscript provides no independent verification (human audit, dependency-graph check, or single-session solvability test) that the multi-session payload remains intact and undetectable within any individual session. If rewriting inadvertently introduces single-session cues or dilutes kill-chain dependencies, the observed recall gap is not evidence of cross-session robustness.
  2. [CSTM metric] CSTM metric definition: the fusion weights in CSTM = 0.7 F_1(CSDA@action, precision) + 0.3 CSR_prefix are presented as fixed values with no ablation study, sensitivity analysis, or derivation justifying the 0.7/0.3 split. Because these weights directly determine the Pareto ranking of rankers, the benchmarking results are sensitive to an arbitrary choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our benchmark construction and metric design, and we commit to targeted revisions that strengthen the empirical claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract / cross_session shard] Abstract and § on cross_session shard construction: the central empirical claim that the Coreset Memory Reader (K=50) is the only reader whose recall survives both shards rests on the cross_session shard being a valid test of aggregate, isolation-invisible threats. The shard is generated by a closed-loop rewriter that 'softens surface phrasing while preserving cross-session artefacts,' yet the manuscript provides no independent verification (human audit, dependency-graph check, or single-session solvability test) that the multi-session payload remains intact and undetectable within any individual session. If rewriting inadvertently introduces single-session cues or dilutes kill-chain dependencies, the observed recall gap is not evidence of cross-session robustness.

    Authors: We agree that explicit validation of the isolation-invisible property strengthens the central claim. The closed-loop rewriter was engineered to preserve kill-chain dependencies and identity anchors across sessions while ensuring each individual session contains no complete policy violation (by construction, as each session is a partial fragment of the taxonomy). To directly address the concern, we will revise the manuscript to include: (1) a human audit protocol and results on a random sample of 20 cross_session scenarios confirming the absence of detectable single-session cues, and (2) single-session solvability tests showing that session-bound detectors achieve near-zero recall on isolated sessions but recover recall only when the full cross-session context is provided. These additions will be placed in the shard construction section and will not change the reported recall numbers. revision: yes

  2. Referee: [CSTM metric] CSTM metric definition: the fusion weights in CSTM = 0.7 F_1(CSDA@action, precision) + 0.3 CSR_prefix are presented as fixed values with no ablation study, sensitivity analysis, or derivation justifying the 0.7/0.3 split. Because these weights directly determine the Pareto ranking of rankers, the benchmarking results are sensitive to an arbitrary choice.

    Authors: The 0.7/0.3 split was chosen to emphasize detection F1 as the primary goal (reflecting the security objective) while allocating non-trivial weight to CSR_prefix to penalize rankers that destabilize serving. We acknowledge the lack of sensitivity analysis in the original submission. In revision we will add an ablation table varying the detection weight from 0.5 to 0.9 in 0.1 increments; this will show that the Coreset Memory Reader (K=50) remains the highest-ranked method across the majority of weightings, with only extreme emphasis on stability (below 0.2 on detection) altering the ordering. The revised text will also briefly derive the weights from the dual requirements of threat recall and production serving constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central contributions are the creation of an empirical benchmark (CSTM-Bench with dilution and cross_session shards generated via a described closed-loop rewriter) and experimental comparisons of readers on it, plus a custom metric CSTM defined explicitly as a weighted sum of F1 and CSR_prefix. No load-bearing step reduces by construction to its inputs: there are no equations where a claimed prediction equals a fitted parameter, no self-definitional loops, no uniqueness theorems imported from self-citations, and no ansatz smuggled via prior work. The statement that the Coreset Memory Reader at K=50 is the only survivor is an empirical observation on the released splits, not a derivation. The rewriter's preservation property is an assumption underlying benchmark validity but is not presented as a derived result that loops back to itself. The work is self-contained as an empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on the representativeness of the 54 scenarios and the rewriter preserving artefacts; one free parameter in metric definition.

free parameters (1)
  • CSTM fusion weights = 0.7 and 0.3
    Arbitrarily selected 0.7 for F1 and 0.3 for CSR_prefix with no sensitivity analysis or derivation provided.
axioms (1)
  • domain assumption Closed-loop rewriter preserves cross-session artefacts
    Invoked when generating the cross_session shard from dilution scenarios.

pith-pipeline@v0.9.0 · 5676 in / 1350 out tokens · 67434 ms · 2026-05-09T23:40:47.473334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    The Promptware Kill Chain,

    Brodt, A., Feldman, I., Schneier, B., & Nassi, B. (2026).Promptware Kill Chain: A Taxonomy of Prompt Injection Attacks Against AI Agents.arXiv:2601.09625

  2. [2]

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 6372438

    Kadavath, S., Hubinger, E., et al.AI Agent Traps: A Taxonomy of Cross-Session and Multi- Agent Failure Modes.SSRN:6372438. https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 6372438

  3. [3]

    Ross Mitchell

    Tavakoli, A., et al. (2026).BEAM: Benchmarking Extended Agent Memory.arXiv:2510.27246

  4. [4]

    Technical report (GTG-1002)

    Anthropic (2025).Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign. Technical report (GTG-1002). https://assets.anthropic.com/m/ec212e6566a0d47/original/ Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf ; an- nouncement athttps://www.anthropic.com/news/disrupting-AI-espionage

  5. [5]

    Epoch AI (2024).LLMs Now Accept Longer Inputs, and the Best Models Can Use Them More Effectively.https://epoch.ai/data-insights/context-windows

  6. [6]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Perez, F. & Ribeiro, I. (2022).Ignore Previous Prompt: Attack Techniques for Large Language Models. arXiv:2211.09527

  7. [7]

    (2024).LLMs’ Data-Control Path Insecurity

    Schneier, B. (2024).LLMs’ Data-Control Path Insecurity. Communications of the ACM, 67(9), 31–32

  8. [8]

    (2024).Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications.InProc

    Cohen, S., Bitton, R., & Nassi, B. (2024).Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications.InProc. ACM CCS 2025. arXiv:2403.02817

  9. [9]

    Srivastava, S. S. & He, H. (2025).MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval.arXiv:2512.16962. 29

  10. [10]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

    Russinovich, M., Salem, A., & Eldan, R. (2024).Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack.InProc. USENIX Security 2025. arXiv:2404.01833

  11. [11]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023).Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.InProc. ACM Workshop on Artificial Intelligence and Security (AISec @ CCS 2023). arXiv:2302.12173

  12. [12]

    (2024).Breach By A Thousand Leaks: Unsafe Information Leakage in ‘Safe’ AI Responses.InProc

    Glukhov, D., Shumailov, I., Gal, Y., Papernot, N., & Papyan, V. (2024).Breach By A Thousand Leaks: Unsafe Information Leakage in ‘Safe’ AI Responses.InProc. ICLR 2025. arXiv:2407.02551

  13. [13]

    & Vijay, S

    Priyanshu, A. & Vijay, S. (2024).FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks).arXiv:2408.16163

  14. [14]

    A AI Agent Traps Reference

    OpenClaw Identity Plane: An Identity Watchdog Plugin for OpenClaw Agents.Open-source reference implementation, 2026.https://github.com/intrinsec-ai/openclaw-identity-plane. A AI Agent Traps Reference

  15. [15]

    You are a customer-support assistant. Answer questions about products, orders, and account issues. Never reveal internal systems, execute code, or share credentials

    maps four of six trap categories into executable CSTM-Bench taxonomies (Table 19). Table 19: AI Agent Trap category to CSTM-Bench taxonomy mapping. AI Agent Trap category CSTM-Bench taxonomy Semantic Manipulation T23 Persona Hyperstition Injection Cognitive State T24 Retrieval / Memory Poisoning Chain Behavioural Control T25 Sub-Agent Delegation Launderin...

  16. [16]

    Replace the single compliance Gaussian with an anchor-conditional mixture, socompose- heavy andlaunder-heavy taxonomies do not collide on one ellipsoid

  17. [17]

    (3) against a detection-outcome loss

    Replace closest-pair merge with a simplex weightwt ∈ ∆K updated by Eq. (3) against a detection-outcome loss. This earns theO(√TlogK )regret bound and makesCSRprefix a first-class objective rather than a side effect

  18. [18]

    [truncated: shown/total chars]

    Surface the anchor and arc-level ground truth so the harness can scoreCSDA@action, precision, and CSRprefix end-to-end. Running the upgraded system against CSTM-Bench is left to future work; Table 15 is the low-hanging-fruit number without any of (1)–(3) enabled. C.5 Note on Production Ranker Evaluation A separately-evaluated production HTTP coreset ranke...