Recognition: unknown
Trace: Unmasking AI Attack Agents Through Terminal Behavior Fingerprinting
Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3
The pith
Terminal command sequences can identify the AI model family behind autonomous attack agents and guide targeted extraction of their system prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trace shows that different frontier model families produce distinguishable patterns in the terminal command sequences they generate during autonomous attacks. By using these patterns for attribution, the framework then crafts and deploys defensive prompt injection payloads that recover system prompts from the attacker agent at rates substantially higher than blind deployment, including full recovery in some black-box cases.
What carries the argument
Fingerprinting of model families from terminal command sequence patterns, which then directs model-specific defensive prompt injection to extract system prompts.
If this is right
- Model family fingerprinting reaches a macro F1 score of 0.981 overall and 0.815 when tested on unseen scaffolds.
- Guided defensive prompt injection recovers system prompts from 81.9 percent of non-Claude sessions on average, reaching 98.3 percent in some cases and 1.88 times the fidelity of blind methods.
- The same attribution works in black-box proprietary scaffolds, correctly identifying Gemini and Claude Opus families at 78 percent average accuracy and fully exposing the Gemini system prompt.
- The method supplies a concrete first step for forensic analysis of AI-driven attack agents in compromised networks.
Where Pith is reading between the lines
- Defenders could maintain libraries of observed command-sequence signatures for rapid model identification during live incidents.
- Observed differences in how model families respond to the same injection payload suggest that security tools could exploit family-specific prompt-handling traits more broadly.
- The technique might extend to other command-line or API interfaces used by AI agents, or to mixed human-plus-AI attack sessions.
Load-bearing premise
Command sequences produced by different AI model families stay sufficiently distinctive and stable across scaffolds, environments, and black-box conditions.
What would settle it
If command sequences from two different model families become statistically indistinguishable when run through identical scaffolds in controlled tests, the attribution step would lose reliability.
Figures
read the original abstract
AI-driven penetration testing agents are now capable of autonomously executing attacks within compromised networks. Identifying the model family that controls the active sessions of such agents provides valuable information towards understanding the intent of the attack and further developing attack countermeasures. In this paper, we introduce Trace, a novel multi-stage attribution and forensic framework for AI attack agents using terminal command sequences. Once Trace identifies a model family for the attacker agents, it guides a defensive prompt injection (DPI) strategy to the attacker model via a crafted payload. This is with the aim to exfiltrate system prompts from an attacker model, thus, revealing valuable information to understand the attacker intent and facilitate further forensic investigation. We have implemented our approach revolving around a Linux capture-the-flag (CTF) box. The attacker agents are bolstered via three distinct scaffolds and seven frontier model families. Our evaluation reveals that Trace achieves a macro F1 score of 0.981 in accurately fingerprinting the attacker model family (0.815 when generalizing to unseen scaffolds). Besides, the fingerprinting guides the DPI via a crafted payload to certain model families, resulting in system prompt extraction from 81.9% of non-Claude sessions on average (up to 98.3%) at 0.736 Sentence-BERT fidelity -- 1.88x higher than blind deployment. Finally, to validate the robustness of Trace, we evaluate it with a blackbox and proprietary scaffold employing multiple model families (Gemini and Claude Opus). Our evaluation identified the model family with an average 78% accuracy. Moreover, for the Gemini model family, the DPI employed by Trace revealed the entire system prompt and this has been confirmed by the developers. Trace therefore provides a fundamental first step towards attacker agent forensics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Trace, a multi-stage attribution and forensic framework that fingerprints AI attack agents to their model families using terminal command sequences observed in a Linux CTF environment. Once the model family is identified, Trace selects and deploys a tailored defensive prompt injection (DPI) payload to exfiltrate the attacker's system prompt. Evaluation across seven frontier model families, three scaffolds, and a black-box proprietary scaffold reports macro F1 of 0.981 for family fingerprinting (dropping to 0.815 on unseen scaffolds), 81.9% average prompt extraction (up to 98.3%) at 0.736 Sentence-BERT similarity (1.88x over blind), and 78% family identification accuracy in black-box settings, with full Gemini prompt recovery confirmed by developers.
Significance. If the terminal sequences prove to be stable, model-family-intrinsic signals rather than scaffold artifacts, Trace offers a practical first step toward forensic attribution and active information extraction from autonomous AI agents. The black-box validation and external developer confirmation are notable strengths that increase the work's credibility and potential impact in AI security and penetration-testing research.
major comments (3)
- [Evaluation (unseen scaffolds and black-box)] Evaluation section (unseen-scaffold results): the 0.166 F1 drop from 0.981 to 0.815, together with the 78% black-box accuracy, indicates that the classifier may be capturing scaffold-specific command styles or error-handling patterns rather than stable model-family traits; without feature-importance analysis or ablation on command subsequences, it is unclear whether the central fingerprinting claim generalizes to novel or proprietary scaffolds, which directly affects the reliability of the downstream DPI-guided extraction rates.
- [Methods / Experimental Setup] Methods and experimental setup: the abstract and results report strong quantitative claims (F1 0.981, 81.9% extraction) but provide no description of feature engineering, classifier architecture, train/test splits, cross-validation, or controls for scaffold-specific artifacts; this absence makes it impossible to assess whether the performance is reproducible or load-bearing for the end-to-end forensic pipeline.
- [DPI results] DPI evaluation: the reported 81.9% extraction success (and 1.88x lift) is conditioned on correct family fingerprinting; given the scaffold-generalization gap, the manuscript should quantify how often mis-attribution leads to ineffective DPI payloads and whether the extraction benefit holds when the scaffold is unknown a priori.
minor comments (2)
- [Abstract] Abstract: the phrase 'non-Claude sessions' is undefined and should be clarified in the main text.
- [Evaluation] Notation: 'Sentence-BERT fidelity' is used without a reference or precise definition of the similarity metric.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing that additional methodological details and analyses will improve the manuscript. We will incorporate all suggested revisions in the next version.
read point-by-point responses
-
Referee: Evaluation section (unseen-scaffold results): the 0.166 F1 drop from 0.981 to 0.815, together with the 78% black-box accuracy, indicates that the classifier may be capturing scaffold-specific command styles or error-handling patterns rather than stable model-family traits; without feature-importance analysis or ablation on command subsequences, it is unclear whether the central fingerprinting claim generalizes to novel or proprietary scaffolds, which directly affects the reliability of the downstream DPI-guided extraction rates.
Authors: We acknowledge the performance drop on unseen scaffolds and agree that this warrants further investigation to distinguish model-family signals from scaffold artifacts. The 78% black-box accuracy on a proprietary scaffold and the external developer confirmation of full Gemini prompt recovery provide supporting evidence that family-level traits persist beyond the training scaffolds. In the revision, we will add feature-importance analysis (via permutation importance and SHAP values on command n-grams) and subsequence ablation studies to identify stable, model-intrinsic patterns. We will also expand the discussion of generalization limits and their implications for DPI reliability. revision: yes
-
Referee: Methods and experimental setup: the abstract and results report strong quantitative claims (F1 0.981, 81.9% extraction) but provide no description of feature engineering, classifier architecture, train/test splits, cross-validation, or controls for scaffold-specific artifacts; this absence makes it impossible to assess whether the performance is reproducible or load-bearing for the end-to-end forensic pipeline.
Authors: We apologize for the omission of these details in the original submission. The revised manuscript will include a substantially expanded Methods section that fully describes the feature engineering pipeline (command tokenization, n-gram extraction, TF-IDF or embedding vectorization), the classifier architecture and hyperparameters, the train/test split methodology (including scaffold-aware partitioning), cross-validation procedures, and any explicit controls for scaffold-specific artifacts. These additions will enable reproducibility assessment and clarify the pipeline's robustness. revision: yes
-
Referee: DPI evaluation: the reported 81.9% extraction success (and 1.88x lift) is conditioned on correct family fingerprinting; given the scaffold-generalization gap, the manuscript should quantify how often mis-attribution leads to ineffective DPI payloads and whether the extraction benefit holds when the scaffold is unknown a priori.
Authors: We agree that end-to-end evaluation accounting for attribution errors is necessary. In the revision, we will add a dedicated analysis quantifying DPI success rates when using the predicted family (including mis-attribution cases) versus ground-truth family labels. We will also report full-pipeline extraction performance in a blind setting where both model family and scaffold are unknown a priori, comparing against blind baselines to demonstrate the net benefit of the Trace-guided approach even under realistic attribution uncertainty. revision: yes
Circularity Check
No circularity: purely empirical framework and evaluation
full rationale
The paper describes an implemented multi-stage forensic framework (Trace) for fingerprinting AI agents via terminal command sequences, followed by DPI-guided prompt extraction. All key results (macro F1 of 0.981/0.815, 81.9% extraction rate, 78% black-box accuracy) are reported as direct experimental measurements on a Linux CTF setup across seven model families and three scaffolds. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The derivation chain consists solely of observable performance metrics on held-out and black-box data, rendering the work self-contained with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2502.20589 (2025)
Alhazbi, S., et al.: LLMs have rhythm: Fingerprinting via inter-token times. arXiv preprint arXiv:2502.20589 (2025)
-
[2]
Anthropic: Claude code.https://docs.anthropic.com/en/docs/claude-code (2025)
2025
-
[3]
Apart Research: Catastrophic cyber capabilities benchmark (3CB). arXiv preprint arXiv:2410.09114 (2024)
-
[4]
In: Bauer, L., Pellegrino, G
Ayzenshteyn, D., Weiss, R., Mirsky, Y.: Cloak, honey, trap: Proactive defenses against LLM agents. In: Bauer, L., Pellegrino, G. (eds.) 34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025. pp. 8095–8114. USENIX Association (2025)
2025
-
[5]
BerriAI: LiteLLM: Call 100+ LLM APIs in OpenAI format.https://github.com/ BerriAI/litellm(2024)
2024
-
[6]
Pentestgpt: An llm- empowered automatic penetration testing tool
Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: An LLM-empowered automatic penetration testing tool. arXiv preprint arXiv:2308.06782 (2023)
-
[7]
Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024
Fang, R., Bindu, R., Gupta, A., Zhan, Q., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024)
-
[8]
arXiv preprint arXiv:2601.17406 (2025)
Ghaleb, T.A., et al.: Fingerprinting AI coding agents on GitHub. arXiv preprint arXiv:2601.17406 (2025)
-
[9]
AISec (2023)
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. AISec (2023)
2023
-
[10]
In: NeurIPS (2020)
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. In: NeurIPS (2020)
2020
-
[11]
In: Bauer, L., Pellegrino, G
Kim, H., Song, M., Na, S.H., Shin, S., Lee, K.: When llms go online: The emerg- ing threat of web-enabled llms. In: Bauer, L., Pellegrino, G. (eds.) 34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15,
2025
-
[12]
1729–1748
pp. 1729–1748. USENIX Association (2025)
2025
-
[13]
In: ICML (2023)
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T.: A water- mark for large language models. In: ICML (2023)
2023
-
[14]
Lupinacci, M., Pironti, F.A., Blefari, F., Romeo, F., Arena, L., Furfaro, A.: The dark side of llms: Agent-based attacks for complete computer takeover (2025), https://arxiv.org/abs/2507.06850
work page internal anchor Pith review arXiv 2025
-
[15]
In: ICML (2023)
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., Finn, C.: DetectGPT: Zero- shot machine-generated text detection using probability curvature. In: ICML (2023)
2023
-
[16]
Onyx: LLM leaderboard: Aggregate model rankings.https://onyx.app/llm- leaderboard(2025)
2025
-
[17]
OpenRouter: OpenRouter: A unified interface for LLMs.https://openrouter.ai (2024)
2024
-
[18]
In: USENIX Security (2025)
Pasquini, D., Kornaropoulos, E.M., Ateniese, G.: LLMmap: Fingerprinting for large language models. In: USENIX Security (2025)
2025
-
[19]
JMLR12, 2825–2830 (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in Python. JMLR12, 2825–2830 (2011)
2011
-
[20]
In: EMNLP (2019) 22 Murali Ediga and Sudipta Chattopadhyay
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: EMNLP (2019) 22 Murali Ediga and Sudipta Chattopadhyay
2019
-
[21]
arXiv preprint arXiv:2510.07176 (2025)
Ren, Z., et al.: Exposing LLM user privacy via traffic fingerprint analysis: A study of privacy risks in LLM agent interactions. arXiv preprint arXiv:2510.07176 (2025)
-
[22]
Reworr, Volkov, D.: LLM agent honeypot: Monitoring AI hacking agents in the wild (2024)
2024
-
[23]
Information Processing & Management24(5), 513–523 (1988)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management24(5), 513–523 (1988)
1988
-
[24]
Thinkst Applied Research: Canarytokens.https://canarytokens.org(2024)
2024
-
[25]
In: EMNLP (2020)
Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation. In: EMNLP (2020)
2020
-
[26]
Vulnetic: Vulnetic: Adversarial AI testing platform.https://vulnetic.com(2025)
2025
-
[27]
In: ACM CCS (2002)
Wagner, D., Soto, P.: Mimicry attacks on host-based intrusion detection systems. In: ACM CCS (2002)
2002
-
[28]
ICLR (2023)
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. ICLR (2023)
2023
-
[29]
arXiv preprint arXiv:2304.12008 (2023)
Yu, P., et al.: CHEAT: A large-scale dataset for detecting ChatGPT-written ab- stracts. arXiv preprint arXiv:2304.12008 (2023)
-
[30]
Cybench: A framework for evaluating cybersecurity capabilities and risks of language models
Zhang, A.K., Perry, N., et al.: CyBench: A framework for evaluating cybersecurity capabilities. arXiv preprint arXiv:2408.08926 (2024)
-
[31]
session id
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: ICLR (2020) A DPI Payload Variants and Extractions Figure 2 shows the abbreviated text of each DPI payload variant alongside rep- resentative extractions from agent sessions. B CTF Container Flag Details The target Docker container (Ubuntu 2...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.