CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

Chia-Mu Yu; Po-Han Cheng; Wei-Bin Lee; Ying-Dar Lin; Yu-Sung Wu

arxiv: 2606.19235 · v1 · pith:3M3BJQFRnew · submitted 2026-06-17 · 💻 cs.CR

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

Po-Han Cheng , Chia-Mu Yu , Ying-Dar Lin , Yu-Sung Wu , Wei-Bin Lee This is my paper

Pith reviewed 2026-06-26 20:13 UTC · model grok-4.3

classification 💻 cs.CR

keywords indirect prompt injectioncode LLMsTree-sitterCST nodesinference-time defenseprompt sanitizationcode securityadversarial detection

0 comments

The pith

CodeSentinel uses a three-layer sanitizer to detect and neutralize indirect prompt injections in code contexts for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code large language models retrieve external code from repositories, documentation, and issue threads, creating opportunities for attackers to hide instructions in comments, strings, identifiers, or decoy code. The paper presents CodeSentinel as an inference-time defense that first extracts high-risk nodes from the code syntax tree. It then runs syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring, and node perturbation analysis to find triggers. Nodes flagged as adversarial or semantic triggers are removed or neutralized before the code reaches the model. This targets attacks that blend into natural-looking code across multiple known families.

Core claim

We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring, and node perturbation analysis to detect adversarial and natural-looking semantic triggers. Detected nodes are removed or neutralized before reaching the downstream Code LLM.

What carries the argument

Three-layer inference-time sanitizer that extracts high-risk CST nodes with Tree-sitter and applies syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring, and node perturbation analysis to identify and sanitize triggers.

If this is right

High-risk nodes can be isolated and removed without discarding the entire code context.
The approach covers attacks hidden in comments, strings, identifiers, and decoy code from various retrieval sources.
Node-level intervention leaves most of the original code available to the model.
The defense operates at inference time without requiring retraining of the downstream LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The syntax-tree focus could be extended to other structured data like configuration files or markup that LLMs might retrieve.
Future attacks designed to mimic benign code structure more closely would require updates to the scoring or perturbation steps.
The method's precision on individual nodes suggests it could pair with existing static analysis tools for layered security in coding environments.

Load-bearing premise

The combination of Tree-sitter node extraction with the three detection methods will catch both obvious adversarial and subtle natural-looking triggers without too many false positives or misses.

What would settle it

A collection of new attack examples that embed instructions while evading all three layers, or a large body of clean code that triggers frequent false detections and removals.

Figures

Figures reproduced from arXiv: 2606.19235 by Chia-Mu Yu, Po-Han Cheng, Wei-Bin Lee, Ying-Dar Lin, Yu-Sung Wu.

**Figure 1.** Figure 1: CodeSentinel workflow. The final detection set is Sˆ = Sˆ 1 ∪ Sˆ 2 ∪ Sˆ 3. (5) Nodes flagged by earlier layers are not evaluated by later, more expensive layers. Sanitization is applied once to Sˆ, which avoids score instability from repeatedly modifying the context during detection. This design makes CodeSentinel usable as a preAPI sanitizer for both open-weight and black-box victim models. 3.3 Layer 1:… view at source ↗

**Figure 3.** Figure 3: Cumulative performance of the three-layer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Node-level ROC curve of CodeSentinel. The AUROC is 0.82. literals, identifiers are renamed through scopeconsistent renaming, and decoy blocks are removed only when reachability analysis certifies that they are unreachable. After rewriting, CodeSentinel reparses the sanitized context and rejects edits that break syntax [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Cross-surrogate generalization when different [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Sample-level ASR on open-weight victim models before and after applying CodeSentinel. Victim Model Before After Impact Claude-3.5-Haiku 27.11 7.72 -19.39 GPT-5.1-Codex-mini 18.32 5.81 -12.51 Gemini-3.1-Flash-lite 24.14 8.28 -15.86 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Execution time per code sample. CodeSentinel keeps preprocessing latency low. code-generation or code-assistance workflows. Table 6 compares defenses along deployment-relevant dimensions, including black-box compatibility, static analysis, node-level perturbation, likelihoodbased detection, and whether model training or runtime redesign is required. C Runtime Analysis [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 7.** Figure 7: Alternative CodeSentinel workflow [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Illustrative example of CodeSentinel on an adversarial C prompt. The adversarial identifier biases the model, while neutralization removes the misleading lexical cue. CWE-476 denotes NULL pointer dereference. Attack Original Adaptive Impact Decoy Injection 0.82 0.74 -0.08 Copy Trigger 0.82 0.62 -0.20 Contextual Attack 0.82 0.66 -0.16 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Standalone classification performance and inference latency of each defense layer. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Transferability effect on sample-level ASR before and after defense. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-tokenizer robustness when attack payloads and defense detection use different tokenizer spaces. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node perturbation analysis to detect adversarial and natural-looking semantic triggers. Detected nodes are removed or neutralized before reaching the downstream Code LLM. Across six recent attack families, \CodeSentinel achieves 0.80 average node-level F1, outperforming CodeGarrison, DePA, and KillBadCode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeSentinel names a three-layer pipeline for indirect prompt injection in code LLMs but supplies no evaluation details, so the 0.80 F1 claim cannot be checked.

read the letter

The paper's main contribution is a named three-layer sanitizer that extracts high-risk CST nodes with Tree-sitter, applies syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring, and node perturbation analysis, then removes or neutralizes the nodes before they reach a code LLM. It targets a practical surface: external code pulled from repos, docs, or issues that can carry hidden instructions in comments, strings, or identifiers.

The architecture itself is straightforward and directly addresses the threat model the authors describe. Framing the problem around node-level detection rather than whole-prompt filtering is a reasonable choice for code contexts.

The soft spot is the complete absence of supporting evidence for the central claim. The text lists six attack families and states an average node-level F1 of 0.80 that beats CodeGarrison, DePA, and KillBadCode, but it gives no description of those families, no labeling protocol, no measurement of false positives on clean code, no exact Min-K% formulation, no perturbation procedure, and no ablation that shows what each layer adds. Without those, the performance number is not usable.

The assumption that the combination will catch both adversarial and natural-looking triggers without excessive false positives is stated but not tested in any visible way. This is not a minor omission; it is the load-bearing part of the paper.

The work is aimed at people who build or secure code-generation tools that retrieve external context. A reader looking for a high-level defense sketch might note the pipeline, but the missing experimental details mean there is nothing solid to cite or replicate yet.

I would not send this to peer review until the authors supply the datasets, attack definitions, metric definitions, and ablations. The current version does not give referees enough to evaluate.

Referee Report

2 major / 0 minor

Summary. The paper proposes CodeSentinel, a three-layer inference-time sanitizer for defending Code LLMs against indirect prompt injection attacks hidden in external code contexts (comments, strings, identifiers). The layers consist of Tree-sitter extraction of high-risk CST nodes, syntax-guided pre-filtering, CST-guided Dynamic Min-K% scoring, and node perturbation analysis; detected nodes are removed or neutralized. The central empirical claim is that this pipeline achieves 0.80 average node-level F1 across six recent attack families and outperforms the baselines CodeGarrison, DePA, and KillBadCode.

Significance. If the evaluation protocol and results can be substantiated, the work addresses a timely and practically relevant security threat to code-generating LLMs that retrieve untrusted context. The combination of static CST analysis with dynamic scoring and perturbation offers a concrete, deployable defense strategy at inference time.

major comments (2)

[Abstract] Abstract: The central performance claim (0.80 average node-level F1 across six attack families, outperforming named baselines) is presented without any accompanying dataset description, attack-family definitions, node-labeling protocol, false-positive measurement on clean code, exact Dynamic Min-K% formulation, perturbation procedure, or ablation isolating each layer. This absence renders the empirical result impossible to assess or reproduce.
No section, table, or appendix supplies the experimental setup required to support the weakest assumption (that the three-layer pipeline reliably separates adversarial triggers from natural semantic code without excessive false positives). Without these elements the soundness of the 0.80 F1 figure cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the experimental protocol. We agree that the current manuscript does not provide sufficient detail on datasets, attack definitions, labeling, false-positive evaluation, scoring formulation, perturbation, and ablations to allow full assessment or reproduction of the 0.80 F1 result. We will perform a major revision to supply these elements.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (0.80 average node-level F1 across six attack families, outperforming named baselines) is presented without any accompanying dataset description, attack-family definitions, node-labeling protocol, false-positive measurement on clean code, exact Dynamic Min-K% formulation, perturbation procedure, or ablation isolating each layer. This absence renders the empirical result impossible to assess or reproduce.

Authors: We accept this criticism. The abstract is intentionally concise and therefore omits these details. In the revised manuscript we will expand the abstract with a brief parenthetical reference to the evaluation protocol and will ensure the main text and a new appendix contain the requested information (dataset sources and statistics, attack-family definitions with examples, node-labeling rules, clean-code false-positive rates, the precise Dynamic Min-K% formula, perturbation steps, and layer-wise ablations). revision: yes
Referee: [—] No section, table, or appendix supplies the experimental setup required to support the weakest assumption (that the three-layer pipeline reliably separates adversarial triggers from natural semantic code without excessive false positives). Without these elements the soundness of the 0.80 F1 figure cannot be evaluated.

Authors: We agree that the current version does not contain a self-contained experimental-setup section or appendix with the listed elements. The revision will add a dedicated Experimental Setup section (with subsections on data collection, attack families, labeling protocol, and metrics) plus an appendix containing the exact Dynamic Min-K% formulation, perturbation procedure, clean-code false-positive measurements, and ablation tables. These additions will directly address the concern about false positives on natural code. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claim with no derivations or self-referential reductions

full rationale

The paper presents CodeSentinel as a three-layer inference-time sanitizer using Tree-sitter CST extraction, syntax pre-filtering, Dynamic Min-K% scoring, and perturbation analysis, then reports an empirical result (0.80 average node-level F1 across six attack families). No equations, parameter-fitting steps, predictions derived from fitted inputs, uniqueness theorems, or self-citations appear in the provided abstract or described structure. The central claim is a direct experimental outcome rather than a derivation that reduces to its own inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5664 in / 1168 out tokens · 26501 ms · 2026-06-26T20:13:55.920248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 linked inside Pith

[1]

Journal of Systems and Software , volume =

Poisoned source code detection in code models , author =. Journal of Systems and Software , volume =
[2]

Tsai, Chi-Chien and Yu, Chia-Mu and Lin, Ying-Dar and Wu, Yu-Sung and Lee, Wei-Bin , year =. Beyond
[3]

arXiv preprint arXiv:2108.07732 , year=

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

Pith/arXiv arXiv
[4]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =
[5]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[6]

Iterative

Huang, Li and Sun, Weifeng and Yan, Meng , year =. Iterative. 2025

2025
[7]

Štorek, Adam and Gupta, Mukur and Bhatt, Noopur and Gupta, Aditya and Kim, Janie and Srivastava, Prashast and Jana, Suman , year =
[8]

Jenko, Slobodan and Mündler, Niels and He, Jingxuan and Vero, Mark and Vechev, Martin , year =. Black-
[9]

Sun, Weisong and Chen, Yuchen and Yuan, Mengzhe and Fang, Chunrong and Chen, Zhenpeng and Wang, Chong and Liu, Yang and Xu, Baowen and Chen, Zhenyu , year =. Show
[10]

Liu, Yue and Zhao, Yanjie and Lyu, Yunbo and Zhang, Ting and Wang, Haoyu and Lo, David , year =. "
[11]

and Yu, Tianjiao and Diwan, Nirav and Wang, Gang and Hakkani-Tür, Dilek and Lourentzou, Ismini , year =

Wahed, Muntasir and Zhou, Xiaona and Nguyen, Kiet A. and Yu, Tianjiao and Diwan, Nirav and Wang, Gang and Hakkani-Tür, Dilek and Lourentzou, Ismini , year =
[12]

Yang, Yuchen and Li, Yiming and Yao, Hongwei and Yang, Bingrun and He, Yiling and Zhang, Tianwei and Tao, Dacheng and Qin, Zhan , year =
[13]

2026 , eprint=

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives , author=. 2026 , eprint=

2026
[14]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024

2024
[15]

ACM SIGKDD Conference on Knowledge Discovery and Data Mining , author =

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , url =. ACM SIGKDD Conference on Knowledge Discovery and Data Mining , author =
[16]

Defending against Indirect Prompt Injection by Instruction Detection , url =

Wen, Tongyu and Wang, Chenglong and Yang, Xiyuan and Tang, Haoyu and Xie, Yueqi and Lyu, Lingjuan and Dou, Zhicheng and Wu, Fangzhao , year =. Defending against Indirect Prompt Injection by Instruction Detection , url =
[17]

Chen, Yulin and Li, Haoran and Sui, Yuan and He, Yufei and Liu, Yue and Song, Yangqiu and Hooi, Bryan , year =. Can. Annual Meeting of the Association for Computational Linguistics , url =
[18]

Liu, Mickel and Jiang, Liwei and Liang, Yancheng and Du, Simon Shaolei and Choi, Yejin and Althoff, Tim and Jaques, Natasha , year =. Chasing
[19]

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models , url =

Lin, Huawei and Lao, Yingjie and Geng, Tong and Yu, Tan and Zhao, Weijie , year =. UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models , url =
[20]

Smoke and

Ouyang, Sheng and Qin, Yihao and Lin, Bo and Chen, Liqian and Mao, Xiaoguang and Wang, Shangwen , year =. Smoke and
[21]

Jiao, Yang and Wang, Xiaodong and Yang, Kai , year =
[22]

Li, Haoyang and Gao, Huan and Zhao, Zhiyuan and Lin, Zhiyu and Gao, Junyu and Li, Xuelong , year =
[23]

Li, Haoyang and Li, Mingjin and Zuo, Jinxin and Li, Siqi and Li, Xiao and Wu, Hao and Lu, Yueming and He, Xiaochuan , year =
[24]

ACM Conference on Computer and Communications Security (CCS) , author =
[25]

Chen, Sizhe and Piet, Julien and Sitawarin, Chawin and Wagner, David , year =
[26]

Defeating

Debenedetti, Edoardo and Shumailov, Ilia and Fan, Tianqi and Hayes, Jamie and Carlini, Nicholas and Fabian, Daniel and Kern, Christoph and Shi, Chongyang and Terzis, Andreas and Tramèr, Florian , year =. Defeating

[1] [1]

Journal of Systems and Software , volume =

Poisoned source code detection in code models , author =. Journal of Systems and Software , volume =

[2] [2]

Tsai, Chi-Chien and Yu, Chia-Mu and Lin, Ying-Dar and Wu, Yu-Sung and Lee, Wei-Bin , year =. Beyond

[3] [3]

arXiv preprint arXiv:2108.07732 , year=

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

Pith/arXiv arXiv

[4] [4]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

[5] [5]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[6] [6]

Iterative

Huang, Li and Sun, Weifeng and Yan, Meng , year =. Iterative. 2025

2025

[7] [7]

Štorek, Adam and Gupta, Mukur and Bhatt, Noopur and Gupta, Aditya and Kim, Janie and Srivastava, Prashast and Jana, Suman , year =

[8] [8]

Jenko, Slobodan and Mündler, Niels and He, Jingxuan and Vero, Mark and Vechev, Martin , year =. Black-

[9] [9]

Sun, Weisong and Chen, Yuchen and Yuan, Mengzhe and Fang, Chunrong and Chen, Zhenpeng and Wang, Chong and Liu, Yang and Xu, Baowen and Chen, Zhenyu , year =. Show

[10] [10]

Liu, Yue and Zhao, Yanjie and Lyu, Yunbo and Zhang, Ting and Wang, Haoyu and Lo, David , year =. "

[11] [11]

and Yu, Tianjiao and Diwan, Nirav and Wang, Gang and Hakkani-Tür, Dilek and Lourentzou, Ismini , year =

Wahed, Muntasir and Zhou, Xiaona and Nguyen, Kiet A. and Yu, Tianjiao and Diwan, Nirav and Wang, Gang and Hakkani-Tür, Dilek and Lourentzou, Ismini , year =

[12] [12]

Yang, Yuchen and Li, Yiming and Yao, Hongwei and Yang, Bingrun and He, Yiling and Zhang, Tianwei and Tao, Dacheng and Qin, Zhan , year =

[13] [13]

2026 , eprint=

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives , author=. 2026 , eprint=

2026

[14] [14]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024

2024

[15] [15]

ACM SIGKDD Conference on Knowledge Discovery and Data Mining , author =

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , url =. ACM SIGKDD Conference on Knowledge Discovery and Data Mining , author =

[16] [16]

Defending against Indirect Prompt Injection by Instruction Detection , url =

Wen, Tongyu and Wang, Chenglong and Yang, Xiyuan and Tang, Haoyu and Xie, Yueqi and Lyu, Lingjuan and Dou, Zhicheng and Wu, Fangzhao , year =. Defending against Indirect Prompt Injection by Instruction Detection , url =

[17] [17]

Chen, Yulin and Li, Haoran and Sui, Yuan and He, Yufei and Liu, Yue and Song, Yangqiu and Hooi, Bryan , year =. Can. Annual Meeting of the Association for Computational Linguistics , url =

[18] [18]

Liu, Mickel and Jiang, Liwei and Liang, Yancheng and Du, Simon Shaolei and Choi, Yejin and Althoff, Tim and Jaques, Natasha , year =. Chasing

[19] [19]

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models , url =

Lin, Huawei and Lao, Yingjie and Geng, Tong and Yu, Tan and Zhao, Weijie , year =. UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models , url =

[20] [20]

Smoke and

Ouyang, Sheng and Qin, Yihao and Lin, Bo and Chen, Liqian and Mao, Xiaoguang and Wang, Shangwen , year =. Smoke and

[21] [21]

Jiao, Yang and Wang, Xiaodong and Yang, Kai , year =

[22] [22]

Li, Haoyang and Gao, Huan and Zhao, Zhiyuan and Lin, Zhiyu and Gao, Junyu and Li, Xuelong , year =

[23] [23]

Li, Haoyang and Li, Mingjin and Zuo, Jinxin and Li, Siqi and Li, Xiao and Wu, Hao and Lu, Yueming and He, Xiaochuan , year =

[24] [24]

ACM Conference on Computer and Communications Security (CCS) , author =

[25] [25]

Chen, Sizhe and Piet, Julien and Sitawarin, Chawin and Wagner, David , year =

[26] [26]

Defeating

Debenedetti, Edoardo and Shumailov, Ilia and Fan, Tianqi and Hayes, Jamie and Carlini, Nicholas and Fabian, Daniel and Kern, Christoph and Shi, Chongyang and Terzis, Andreas and Tramèr, Florian , year =. Defeating