arxiv: 2605.01186 · v1 · submitted 2026-05-02 · 💻 cs.CR

Recognition: unknown

Trace: Unmasking AI Attack Agents Through Terminal Behavior Fingerprinting

Murali Ediga , Sudipta Chattopadhyay

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3

classification 💻 cs.CR

keywords AI attack agentsterminal command fingerprintingmodel family attributiondefensive prompt injectionsystem prompt exfiltrationpenetration testing forensicsLLM agent identification

0 comments

The pith

Terminal command sequences can identify the AI model family behind autonomous attack agents and guide targeted extraction of their system prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Trace, a framework that examines sequences of terminal commands issued during attacks on a Linux CTF box to determine which large language model family controls the agent. Once the family is identified, Trace selects a model-specific defensive prompt injection payload designed to exfiltrate the agent's system prompt for forensic insight. A sympathetic reader would care because AI agents are now capable of independent penetration testing, so knowing the exact model behind an active session reveals likely capabilities, intent, and exploitable weaknesses. Evaluation across seven model families and three scaffolds shows strong fingerprinting accuracy that generalizes to unseen setups and yields higher prompt recovery than unguided injections.

Core claim

Trace shows that different frontier model families produce distinguishable patterns in the terminal command sequences they generate during autonomous attacks. By using these patterns for attribution, the framework then crafts and deploys defensive prompt injection payloads that recover system prompts from the attacker agent at rates substantially higher than blind deployment, including full recovery in some black-box cases.

What carries the argument

Fingerprinting of model families from terminal command sequence patterns, which then directs model-specific defensive prompt injection to extract system prompts.

If this is right

Model family fingerprinting reaches a macro F1 score of 0.981 overall and 0.815 when tested on unseen scaffolds.
Guided defensive prompt injection recovers system prompts from 81.9 percent of non-Claude sessions on average, reaching 98.3 percent in some cases and 1.88 times the fidelity of blind methods.
The same attribution works in black-box proprietary scaffolds, correctly identifying Gemini and Claude Opus families at 78 percent average accuracy and fully exposing the Gemini system prompt.
The method supplies a concrete first step for forensic analysis of AI-driven attack agents in compromised networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders could maintain libraries of observed command-sequence signatures for rapid model identification during live incidents.
Observed differences in how model families respond to the same injection payload suggest that security tools could exploit family-specific prompt-handling traits more broadly.
The technique might extend to other command-line or API interfaces used by AI agents, or to mixed human-plus-AI attack sessions.

Load-bearing premise

Command sequences produced by different AI model families stay sufficiently distinctive and stable across scaffolds, environments, and black-box conditions.

What would settle it

If command sequences from two different model families become statistically indistinguishable when run through identical scaffolds in controlled tests, the attribution step would lose reliability.

Figures

Figures reproduced from arXiv: 2605.01186 by Murali Ediga, Sudipta Chattopadhyay.

**Figure 1.** Figure 1: End-to-end Trace pipeline. Stage 1 (blue) passively fingerprints the AI agent’s model family from terminal command sequences. Stage 2 (red) routes family-calibrated DPI payloads to extract the agent’s system prompt as forensic intelligence. The dashed arrow indicates the feedback loop: DPI payloads are planted in the honeypot based on the attribution result. mand, executes it in a terminal, and feeds the r… view at source ↗

**Figure 2.** Figure 2: DPI payload variants (abbreviated) and representative extractions. view at source ↗

**Figure 3.** Figure 3: Abbreviated session JSON. Only tool call entries with non-empty commands are used for feature extraction view at source ↗

read the original abstract

AI-driven penetration testing agents are now capable of autonomously executing attacks within compromised networks. Identifying the model family that controls the active sessions of such agents provides valuable information towards understanding the intent of the attack and further developing attack countermeasures. In this paper, we introduce Trace, a novel multi-stage attribution and forensic framework for AI attack agents using terminal command sequences. Once Trace identifies a model family for the attacker agents, it guides a defensive prompt injection (DPI) strategy to the attacker model via a crafted payload. This is with the aim to exfiltrate system prompts from an attacker model, thus, revealing valuable information to understand the attacker intent and facilitate further forensic investigation. We have implemented our approach revolving around a Linux capture-the-flag (CTF) box. The attacker agents are bolstered via three distinct scaffolds and seven frontier model families. Our evaluation reveals that Trace achieves a macro F1 score of 0.981 in accurately fingerprinting the attacker model family (0.815 when generalizing to unseen scaffolds). Besides, the fingerprinting guides the DPI via a crafted payload to certain model families, resulting in system prompt extraction from 81.9% of non-Claude sessions on average (up to 98.3%) at 0.736 Sentence-BERT fidelity -- 1.88x higher than blind deployment. Finally, to validate the robustness of Trace, we evaluate it with a blackbox and proprietary scaffold employing multiple model families (Gemini and Claude Opus). Our evaluation identified the model family with an average 78% accuracy. Moreover, for the Gemini model family, the DPI employed by Trace revealed the entire system prompt and this has been confirmed by the developers. Trace therefore provides a fundamental first step towards attacker agent forensics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Trace, a multi-stage attribution and forensic framework that fingerprints AI attack agents to their model families using terminal command sequences observed in a Linux CTF environment. Once the model family is identified, Trace selects and deploys a tailored defensive prompt injection (DPI) payload to exfiltrate the attacker's system prompt. Evaluation across seven frontier model families, three scaffolds, and a black-box proprietary scaffold reports macro F1 of 0.981 for family fingerprinting (dropping to 0.815 on unseen scaffolds), 81.9% average prompt extraction (up to 98.3%) at 0.736 Sentence-BERT similarity (1.88x over blind), and 78% family identification accuracy in black-box settings, with full Gemini prompt recovery confirmed by developers.

Significance. If the terminal sequences prove to be stable, model-family-intrinsic signals rather than scaffold artifacts, Trace offers a practical first step toward forensic attribution and active information extraction from autonomous AI agents. The black-box validation and external developer confirmation are notable strengths that increase the work's credibility and potential impact in AI security and penetration-testing research.

major comments (3)

[Evaluation (unseen scaffolds and black-box)] Evaluation section (unseen-scaffold results): the 0.166 F1 drop from 0.981 to 0.815, together with the 78% black-box accuracy, indicates that the classifier may be capturing scaffold-specific command styles or error-handling patterns rather than stable model-family traits; without feature-importance analysis or ablation on command subsequences, it is unclear whether the central fingerprinting claim generalizes to novel or proprietary scaffolds, which directly affects the reliability of the downstream DPI-guided extraction rates.
[Methods / Experimental Setup] Methods and experimental setup: the abstract and results report strong quantitative claims (F1 0.981, 81.9% extraction) but provide no description of feature engineering, classifier architecture, train/test splits, cross-validation, or controls for scaffold-specific artifacts; this absence makes it impossible to assess whether the performance is reproducible or load-bearing for the end-to-end forensic pipeline.
[DPI results] DPI evaluation: the reported 81.9% extraction success (and 1.88x lift) is conditioned on correct family fingerprinting; given the scaffold-generalization gap, the manuscript should quantify how often mis-attribution leads to ineffective DPI payloads and whether the extraction benefit holds when the scaffold is unknown a priori.

minor comments (2)

[Abstract] Abstract: the phrase 'non-Claude sessions' is undefined and should be clarified in the main text.
[Evaluation] Notation: 'Sentence-BERT fidelity' is used without a reference or precise definition of the similarity metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing that additional methodological details and analyses will improve the manuscript. We will incorporate all suggested revisions in the next version.

read point-by-point responses

Referee: Evaluation section (unseen-scaffold results): the 0.166 F1 drop from 0.981 to 0.815, together with the 78% black-box accuracy, indicates that the classifier may be capturing scaffold-specific command styles or error-handling patterns rather than stable model-family traits; without feature-importance analysis or ablation on command subsequences, it is unclear whether the central fingerprinting claim generalizes to novel or proprietary scaffolds, which directly affects the reliability of the downstream DPI-guided extraction rates.

Authors: We acknowledge the performance drop on unseen scaffolds and agree that this warrants further investigation to distinguish model-family signals from scaffold artifacts. The 78% black-box accuracy on a proprietary scaffold and the external developer confirmation of full Gemini prompt recovery provide supporting evidence that family-level traits persist beyond the training scaffolds. In the revision, we will add feature-importance analysis (via permutation importance and SHAP values on command n-grams) and subsequence ablation studies to identify stable, model-intrinsic patterns. We will also expand the discussion of generalization limits and their implications for DPI reliability. revision: yes
Referee: Methods and experimental setup: the abstract and results report strong quantitative claims (F1 0.981, 81.9% extraction) but provide no description of feature engineering, classifier architecture, train/test splits, cross-validation, or controls for scaffold-specific artifacts; this absence makes it impossible to assess whether the performance is reproducible or load-bearing for the end-to-end forensic pipeline.

Authors: We apologize for the omission of these details in the original submission. The revised manuscript will include a substantially expanded Methods section that fully describes the feature engineering pipeline (command tokenization, n-gram extraction, TF-IDF or embedding vectorization), the classifier architecture and hyperparameters, the train/test split methodology (including scaffold-aware partitioning), cross-validation procedures, and any explicit controls for scaffold-specific artifacts. These additions will enable reproducibility assessment and clarify the pipeline's robustness. revision: yes
Referee: DPI evaluation: the reported 81.9% extraction success (and 1.88x lift) is conditioned on correct family fingerprinting; given the scaffold-generalization gap, the manuscript should quantify how often mis-attribution leads to ineffective DPI payloads and whether the extraction benefit holds when the scaffold is unknown a priori.

Authors: We agree that end-to-end evaluation accounting for attribution errors is necessary. In the revision, we will add a dedicated analysis quantifying DPI success rates when using the predicted family (including mis-attribution cases) versus ground-truth family labels. We will also report full-pipeline extraction performance in a blind setting where both model family and scaffold are unknown a priori, comparing against blind baselines to demonstrate the net benefit of the Trace-guided approach even under realistic attribution uncertainty. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework and evaluation

full rationale

The paper describes an implemented multi-stage forensic framework (Trace) for fingerprinting AI agents via terminal command sequences, followed by DPI-guided prompt extraction. All key results (macro F1 of 0.981/0.815, 81.9% extraction rate, 78% black-box accuracy) are reported as direct experimental measurements on a Linux CTF setup across seven model families and three scaffolds. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The derivation chain consists solely of observable performance metrics on held-out and black-box data, rendering the work self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical derivations, fitted parameters, or new postulated entities are described. The approach relies on standard supervised classification of command sequences and empirical prompt-injection success.

pith-pipeline@v0.9.0 · 5618 in / 1252 out tokens · 48526 ms · 2026-05-09T15:14:55.483471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2502.20589 (2025)

Alhazbi, S., et al.: LLMs have rhythm: Fingerprinting via inter-token times. arXiv preprint arXiv:2502.20589 (2025)

work page arXiv 2025
[2]

Anthropic: Claude code.https://docs.anthropic.com/en/docs/claude-code (2025)

2025
[3]

Catastrophic cyber capabilities benchmark (3cb): Robustly evaluating llm agent cyber offense capabilities

Apart Research: Catastrophic cyber capabilities benchmark (3CB). arXiv preprint arXiv:2410.09114 (2024)

work page arXiv 2024
[4]

In: Bauer, L., Pellegrino, G

Ayzenshteyn, D., Weiss, R., Mirsky, Y.: Cloak, honey, trap: Proactive defenses against LLM agents. In: Bauer, L., Pellegrino, G. (eds.) 34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025. pp. 8095–8114. USENIX Association (2025)

2025
[5]

BerriAI: LiteLLM: Call 100+ LLM APIs in OpenAI format.https://github.com/ BerriAI/litellm(2024)

2024
[6]

Pentestgpt: An llm- empowered automatic penetration testing tool

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: An LLM-empowered automatic penetration testing tool. arXiv preprint arXiv:2308.06782 (2023)

work page arXiv 2023
[7]

Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024

Fang, R., Bindu, R., Gupta, A., Zhan, Q., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024)

work page arXiv 2024
[8]

arXiv preprint arXiv:2601.17406 (2025)

Ghaleb, T.A., et al.: Fingerprinting AI coding agents on GitHub. arXiv preprint arXiv:2601.17406 (2025)

work page arXiv 2025
[9]

AISec (2023)

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. AISec (2023)

2023
[10]

In: NeurIPS (2020)

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. In: NeurIPS (2020)

2020
[11]

In: Bauer, L., Pellegrino, G

Kim, H., Song, M., Na, S.H., Shin, S., Lee, K.: When llms go online: The emerg- ing threat of web-enabled llms. In: Bauer, L., Pellegrino, G. (eds.) 34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15,

2025
[12]

1729–1748

pp. 1729–1748. USENIX Association (2025)

2025
[13]

In: ICML (2023)

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T.: A water- mark for large language models. In: ICML (2023)

2023
[14]

Lupinacci, M., Pironti, F.A., Blefari, F., Romeo, F., Arena, L., Furfaro, A.: The dark side of llms: Agent-based attacks for complete computer takeover (2025), https://arxiv.org/abs/2507.06850

work page internal anchor Pith review arXiv 2025
[15]

In: ICML (2023)

Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., Finn, C.: DetectGPT: Zero- shot machine-generated text detection using probability curvature. In: ICML (2023)

2023
[16]

Onyx: LLM leaderboard: Aggregate model rankings.https://onyx.app/llm- leaderboard(2025)

2025
[17]

OpenRouter: OpenRouter: A unified interface for LLMs.https://openrouter.ai (2024)

2024
[18]

In: USENIX Security (2025)

Pasquini, D., Kornaropoulos, E.M., Ateniese, G.: LLMmap: Fingerprinting for large language models. In: USENIX Security (2025)

2025
[19]

JMLR12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in Python. JMLR12, 2825–2830 (2011)

2011
[20]

In: EMNLP (2019) 22 Murali Ediga and Sudipta Chattopadhyay

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: EMNLP (2019) 22 Murali Ediga and Sudipta Chattopadhyay

2019
[21]

arXiv preprint arXiv:2510.07176 (2025)

Ren, Z., et al.: Exposing LLM user privacy via traffic fingerprint analysis: A study of privacy risks in LLM agent interactions. arXiv preprint arXiv:2510.07176 (2025)

work page arXiv 2025
[22]

Reworr, Volkov, D.: LLM agent honeypot: Monitoring AI hacking agents in the wild (2024)

2024
[23]

Information Processing & Management24(5), 513–523 (1988)

Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management24(5), 513–523 (1988)

1988
[24]

Thinkst Applied Research: Canarytokens.https://canarytokens.org(2024)

2024
[25]

In: EMNLP (2020)

Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation. In: EMNLP (2020)

2020
[26]

Vulnetic: Vulnetic: Adversarial AI testing platform.https://vulnetic.com(2025)

2025
[27]

In: ACM CCS (2002)

Wagner, D., Soto, P.: Mimicry attacks on host-based intrusion detection systems. In: ACM CCS (2002)

2002
[28]

ICLR (2023)

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. ICLR (2023)

2023
[29]

arXiv preprint arXiv:2304.12008 (2023)

Yu, P., et al.: CHEAT: A large-scale dataset for detecting ChatGPT-written ab- stracts. arXiv preprint arXiv:2304.12008 (2023)

work page arXiv 2023
[30]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

Zhang, A.K., Perry, N., et al.: CyBench: A framework for evaluating cybersecurity capabilities. arXiv preprint arXiv:2408.08926 (2024)

work page arXiv 2024
[31]

session id

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: ICLR (2020) A DPI Payload Variants and Extractions Figure 2 shows the abbreviated text of each DPI payload variant alongside rep- resentative extractions from agent sessions. B CTF Container Flag Details The target Docker container (Ubuntu 2...

2020