pith. machine review for the scientific record. sign in

arxiv: 2510.22628 · v2 · submitted 2025-10-26 · 💻 cs.CR · cs.AI

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

Pith reviewed 2026-05-18 04:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords adversarial defensejailbreak detectionprompt injectionLLM securitymultilingual moderationreal-time detectionhybrid classifier
0
0 comments X

The pith

Sentra-Guard detects adversarial LLM prompts with 99.96 percent accuracy across more than 100 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Sentra-Guard, a real-time modular system designed to identify and block jailbreak and prompt injection attacks on large language models. The system combines FAISS-indexed SBERT embeddings for semantic capture with fine-tuned transformer classifiers and a fusion module that generates context-aware risk scores. A language-agnostic preprocessing step translates non-English prompts into English for consistent evaluation, while a human-in-the-loop feedback mechanism updates an evolving database of benign and malicious examples. A sympathetic reader would care because the reported results claim far lower attack success rates than current moderation tools, offering a transparent alternative that works with multiple LLM backends.

Core claim

Sentra-Guard identifies adversarial prompts in both direct and obfuscated forms through a hybrid architecture that pairs FAISS-indexed SBERT embedding representations with fine-tuned transformer classifiers. Its classifier-retriever fusion module dynamically computes context-aware risk scores, and a language-agnostic preprocessing layer translates non-English inputs into English to enable detection across over 100 languages. A HITL feedback loop maintains an evolving dual-labeled knowledge base of benign and malicious prompts, yielding a 99.96 percent detection rate with AUC and F1 scores of 1.00 and an attack success rate of only 0.004 percent that outperforms baselines such as LlamaGuard-2

What carries the argument

The classifier-retriever fusion module, which combines semantic embeddings from FAISS-indexed SBERT with fine-tuned transformers to compute context-aware risk scores that flag adversarial intent.

If this is right

  • The system supports fine-tuning and integration with diverse commercial and open-source LLM backends.
  • Continuous human review reduces false positives while adapting to new attack patterns.
  • Multilingual coverage through translation enables consistent protection without separate models per language.
  • Modular design allows scalable deployment for real-time use in production environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could make such hybrid defense layers a default component in public LLM APIs.
  • The approach might generalize to new attack types if the knowledge base grows through ongoing feedback.
  • Similar fusion techniques could apply to other AI safety tasks like detecting harmful content generation.

Load-bearing premise

Automatically translating non-English prompts into English preserves the semantic features that signal adversarial intent without distorting or losing attack signals.

What would settle it

Testing the full system on a fresh set of adversarial prompts written in a low-resource language where machine translation often alters subtle intent, then measuring whether detection rate drops below 99 percent.

Figures

Figures reproduced from arXiv: 2510.22628 by Md. Abir Hossain, Md. Mehedi Hasan, Rafid Mostafiz, Sk Tanzir Mehedi, Ziaur Rahman.

Figure 1
Figure 1. Figure 1: Example of a narrative-based jailbreak prompt targeting financial [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sentra-Guard Architecture Overview: The framework translates non [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC Curve for Sentra-Guard (AUC = 1.00): The ROC curve [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Precision-Recall Curve of Sentra-Guard (F1 = 1.00): The curve [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Sentra-Guard, a real-time modular defense system for detecting and mitigating jailbreak and prompt injection attacks on LLMs. It describes a hybrid architecture combining FAISS-indexed SBERT embeddings with fine-tuned transformer classifiers, a classifier-retriever fusion module for context-aware risk scores, a language-agnostic preprocessing layer that translates non-English prompts to English, an evolving dual-labeled knowledge base, and a HITL feedback loop. The central empirical claim is a 99.96% detection rate (AUC = 1.00, F1 = 1.00) with an attack success rate of 0.004%, outperforming baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%).

Significance. If the performance claims are supported by properly separated evaluation data and rigorous methodology, the work would provide a practical, transparent, and multilingual contribution to LLM adversarial defense. The modular design and emphasis on continual adaptation via HITL and an evolving knowledge base are potentially useful for deployment. However, the near-perfect metrics require strong evidence of generalization to establish significance beyond the specific experimental conditions.

major comments (3)
  1. Abstract: The reported 99.96% detection rate, AUC = 1.00, F1 = 1.00, and ASR of 0.004% are presented without any information on dataset composition, attack diversity, train-test splits, or statistical significance testing. This omission is load-bearing because the central claim of outperforming baselines cannot be evaluated without these details.
  2. Classifier-retriever fusion module: The use of FAISS-indexed SBERT embeddings of the dual-labeled knowledge base to compute risk scores does not specify any temporal cutoff, embedding-distance threshold, or exclusion of evaluation prompts from the index. If test adversarial prompts are near-duplicates of indexed entries, the near-zero ASR would be an artifact of retrieval rather than generalization to novel jailbreaks.
  3. Language-agnostic preprocessing layer: The claim that automatic translation of non-English prompts to English preserves semantic features needed to detect adversarial intent lacks supporting ablation or error analysis. This assumption is load-bearing for the multilingual resilience claim across over 100 languages.
minor comments (2)
  1. Abstract: The specific ASR values for the baselines (LlamaGuard-2 at 1.3% and OpenAI Moderation at 3.7%) should be explicitly tied to the same evaluation protocol for fair comparison.
  2. Ensure all acronyms (SBERT, FAISS, HITL, ASR, AUC) are defined on first use and that the manuscript includes a dedicated evaluation section with full experimental details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed to improve clarity or provide additional evidence, we indicate that changes will be made in the revised version.

read point-by-point responses
  1. Referee: Abstract: The reported 99.96% detection rate, AUC = 1.00, F1 = 1.00, and ASR of 0.004% are presented without any information on dataset composition, attack diversity, train-test splits, or statistical significance testing. This omission is load-bearing because the central claim of outperforming baselines cannot be evaluated without these details.

    Authors: We agree with the referee that the abstract should provide more context on the evaluation methodology to support the performance claims. In the revised version, we will modify the abstract to briefly note the dataset composition (including the mix of English and non-English prompts and attack categories), the use of held-out test sets with no overlap with training data, and the application of cross-validation for assessing statistical significance. This will allow readers to better evaluate the generalizability of the results without needing to refer to the full text immediately. revision: yes

  2. Referee: Classifier-retriever fusion module: The use of FAISS-indexed SBERT embeddings of the dual-labeled knowledge base to compute risk scores does not specify any temporal cutoff, embedding-distance threshold, or exclusion of evaluation prompts from the index. If test adversarial prompts are near-duplicates of indexed entries, the near-zero ASR would be an artifact of retrieval rather than generalization to novel jailbreaks.

    Authors: We thank the referee for highlighting this potential issue. Upon review, the current manuscript does not explicitly state the temporal cutoff or exclusion criteria. We will revise the description of the classifier-retriever fusion module to include: (1) a temporal cutoff ensuring the knowledge base contains only prompts from before the evaluation period, (2) an embedding-distance threshold for flagging near-duplicates, and (3) explicit exclusion of any evaluation prompts that match indexed entries above the threshold. This revision will demonstrate that the low ASR is due to generalization. revision: yes

  3. Referee: Language-agnostic preprocessing layer: The claim that automatic translation of non-English prompts to English preserves semantic features needed to detect adversarial intent lacks supporting ablation or error analysis. This assumption is load-bearing for the multilingual resilience claim across over 100 languages.

    Authors: We acknowledge that the manuscript lacks a dedicated ablation study or error analysis for the translation component. To address this, we will add a new subsection in the Experiments section providing an ablation study on the impact of the preprocessing layer, including performance metrics with and without translation for non-English prompts, and an analysis of translation errors and their effect on detection accuracy. This will strengthen the multilingual resilience claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with independent evaluation metrics

full rationale

The paper presents an architectural framework for Sentra-Guard using FAISS-indexed SBERT embeddings, fine-tuned transformers, a classifier-retriever fusion module, and a language-agnostic preprocessing layer, along with reported empirical results such as 99.96% detection rate and AUC=1.00. No equations, derivations, or self-definitional reductions are present in the provided text. The evolving dual-labeled knowledge base is described as an input to the system rather than a fitted output that forces the evaluation metrics. The high performance is framed as an outcome of testing against baselines, with no quoted mechanism showing that test prompts reduce by construction to indexed training entries or that results are renamed fitted parameters. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The system description relies on standard components (SBERT embeddings, FAISS indexing, transformer fine-tuning, machine translation) whose performance properties are taken from prior literature; the fusion module and risk-score computation are introduced without explicit parameter counts or independent validation.

free parameters (1)
  • risk score threshold
    The decision boundary that converts the fused context-aware risk score into a binary block/allow decision is necessarily tuned on data but is not quantified in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1377 out tokens · 51758 ms · 2026-05-18T04:18:22.130675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

  1. [1]

    Privacy in large language models: Attacks, defenses and future directions,

    H. Li, Y . Chen, J. Luo, J. Wang, H. Peng, Y . Kang, X. Zhang, Q. Hu, C. Chan, Z. Xuet al., “Privacy in large language models: Attacks, defenses and future directions,”arXiv preprint arXiv:2310.10383, 2023

  2. [2]

    Unleashing the potential of prompt engineering for large language models,

    B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,”Patterns, 2025

  3. [3]

    Jailbreaking and mitigation of vulnerabilities in large language models,

    B. Peng, K. Chen, Q. Niu, Z. Bi, M. Liu, P. Feng, T. Wang, L. K. Yan, Y . Wen, Y . Zhanget al., “Jailbreaking and mitigation of vulnerabilities in large language models,”arXiv preprint arXiv:2410.15236, 2024

  4. [4]

    Attack and defense techniques in large language models: A survey and new perspectives,

    Z. Liao, K. Chen, Y . Lin, K. Li, Y . Liu, H. Chen, X. Huang, and Y . Yu, “Attack and defense techniques in large language models: A survey and new perspectives,”arXiv preprint arXiv:2505.00976, 2025

  5. [5]

    Large language models for cyber security: A systematic literature review,

    H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”arXiv preprint arXiv:2405.04760, 2024

  6. [6]

    Jailguard: A universal detection framework for prompt-based attacks on llm systems,

    X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, M. Hu, J. Zhang, Y . Liu, S. Ma, and C. Shen, “Jailguard: A universal detection framework for prompt-based attacks on llm systems,”ACM Transactions on Software Engineering and Methodology, 2025

  7. [7]

    A survey on large language models with multilingualism: Recent advances and new frontiers,

    K. Huang, F. Mo, X. Zhang, H. Li, Y . Li, Y . Zhang, W. Yi, Y . Mao, J. Liu, Y . Xuet al., “A survey on large language models with multilingualism: Recent advances and new frontiers,”arXiv preprint arXiv:2405.10936, 2024

  8. [8]

    The many faces of generalization: From traditional machine learning to llm safety,

    S. Zhu, “The many faces of generalization: From traditional machine learning to llm safety,” Ph.D. dissertation, University of Maryland, College Park, 2025

  9. [9]

    Robustness in large language models: A survey of mitigation strategies and evaluation metrics,

    P. Kumar and S. Mishra, “Robustness in large language models: A survey of mitigation strategies and evaluation metrics,”arXiv preprint arXiv:2505.18658, 2025

  10. [10]

    Didots: Knowledge distillation from large-language-models for dementia obfuscation in transcribed speech,

    D. Woszczyk and S. Demetriou, “Didots: Knowledge distillation from large-language-models for dementia obfuscation in transcribed speech,” arXiv preprint arXiv:2410.04188, 2024

  11. [11]

    Adversarial prompt transformation for systematic jailbreaks of llms,

    K. E. Awoufack, “Adversarial prompt transformation for systematic jailbreaks of llms,” Ph.D. dissertation, Massachusetts Institute of Tech- nology, 2024

  12. [12]

    Survey of vulnerabilities in large language models revealed by 15 adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

    E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

  13. [13]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    J. Luo, W. Zhang, Y . Yuan, Y . Zhao, J. Yang, Y . Gu, B. Wu, B. Chen, Z. Qiao, Q. Longet al., “Large language model agent: A survey on methodology, applications and challenges,”arXiv preprint arXiv:2503.21460, 2025

  14. [14]

    A state-of- the-art review on phishing website detection techniques,

    W. Li, S. Manickam, Y .-W. Chong, W. Leng, and P. Nanda, “A state-of- the-art review on phishing website detection techniques,”IEEE Access, 2024

  15. [15]

    The role of transformer models in advancing blockchain technology: A systematic survey,

    T. Liu, Y . Wang, J. Sun, Y . Tian, Y . Huang, T. Xue, P. Li, and Y . Liu, “The role of transformer models in advancing blockchain technology: A systematic survey,”arXiv preprint arXiv:2409.02139, 2024

  16. [16]

    Comparison of the novel probabilistic self-optimizing vectorized earth observation retrieval classifier with common machine learning algorithms,

    J. P. Musial and J. S. Bojanowski, “Comparison of the novel probabilistic self-optimizing vectorized earth observation retrieval classifier with common machine learning algorithms,”Remote Sensing, vol. 14, no. 2, p. 378, 2022

  17. [17]

    Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms,

    F. Askari, A. Fateh, and M. R. Mohammadi, “Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms,”Neural Networks, vol. 187, p. 107339, 2025

  18. [18]

    Retrieval-Augmented Generation with Graphs (GraphRAG)

    H. Han, Y . Wang, H. Shomer, K. Guo, J. Ding, Y . Lei, M. Halappanavar, R. A. Rossi, S. Mukherjee, X. Tanget al., “Retrieval-augmented generation with graphs (graphrag),”arXiv preprint arXiv:2501.00309, 2024

  19. [19]

    Autodefense: Multi-agent llm defense against jailbreak attacks,

    Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “Autodefense: Multi-agent llm defense against jailbreak attacks,”arXiv preprint arXiv:2403.04783, 2024

  20. [20]

    Towards Measuring the Representation of Subjective Global Opinions in Language Models

    E. Durmus, K. Nguyen, T. I. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield-Dodds, D. Hernandez, N. Josephet al., “Towards measuring the representation of subjective global opinions in language models,”arXiv preprint arXiv:2306.16388, 2023

  21. [21]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

  22. [22]

    Algorithms for adversarially robust deep learning,

    A. B. Robey, “Algorithms for adversarially robust deep learning,” Ph.D. dissertation, University of Pennsylvania, 2024

  23. [23]

    ” do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

  24. [24]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  25. [25]

    Red teaming contemporary ai models: Insights from spanish and basque perspectives,

    M. Romero-Arjona, P. Valle, J. C. Alonso, A. B. S ´anchez, M. Ugarte, A. Cazalilla, V . Cambr´on, J. A. Parejo, A. Arrieta, and S. Segura, “Red teaming contemporary ai models: Insights from spanish and basque perspectives,”arXiv preprint arXiv:2503.10192, 2025

  26. [26]

    A cross-language investigation into jailbreak attacks in large language models,

    J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,”arXiv preprint arXiv:2401.16765, 2024

  27. [27]

    Goal-guided generative prompt injection attack on large language models,

    C. Zhang, M. Jin, Q. Yu, C. Liu, H. Xue, and X. Jin, “Goal-guided generative prompt injection attack on large language models,” in2024 IEEE International Conference on Data Mining (ICDM). IEEE, 2024, pp. 941–946

  28. [28]

    from benign import toxic: Jailbreaking the language model via adversarial metaphors,

    Y . Yan, S. Sun, Z. Duan, T. Liu, M. Liu, Z. Yin, J. Lei, and Q. Li, “from benign import toxic: Jailbreaking the language model via adversarial metaphors,”arXiv preprint arXiv:2503.00038, 2025

  29. [29]

    A comprehensive overview of large lan- guage models (llms) for cyber defences: Opportunities and directions,

    M. Hassanin and N. Moustafa, “A comprehensive overview of large lan- guage models (llms) for cyber defences: Opportunities and directions,” arXiv preprint arXiv:2405.14487, 2024

  30. [30]

    A comprehensive review of adversarial attacks and defense strategies in deep neural networks,

    A. Abomakhelb, K. A. Jalil, A. G. Buja, A. Alhammadi, and A. M. Alenezi, “A comprehensive review of adversarial attacks and defense strategies in deep neural networks,”Technologies, vol. 13, no. 5, p. 202, 2025

  31. [31]

    A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,

    Y . Zhou, T. Ni, W.-B. Lee, and Q. Zhao, “A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,” arXiv preprint arXiv:2502.05224, 2025

  32. [32]

    Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,

    H. Jin, L. Hu, X. Li, P. Zhang, C. Chen, J. Zhuang, and H. Wang, “Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,”arXiv preprint arXiv:2407.01599, 2024

  33. [33]

    A survey on human-in-the-loop applications towards an internet of all,

    D. S. Nunes, P. Zhang, and J. S. Silva, “A survey on human-in-the-loop applications towards an internet of all,”IEEE Communications Surveys & Tutorials, vol. 17, no. 2, pp. 944–965, 2015

  34. [34]

    Red Teaming Language Models with Language Models

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,”arXiv preprint arXiv:2202.03286, 2022

  35. [35]

    Applications, challenges, and future directions of human-in-the-loop learning,

    S. Kumar, S. Datta, V . Singh, D. Datta, S. K. Singh, and R. Sharma, “Applications, challenges, and future directions of human-in-the-loop learning,”IEEE Access, vol. 12, pp. 75 735–75 760, 2024

  36. [36]

    Jailbreaktracer: Explainable detection of jailbreaking prompts in llms using synthetic data generation,

    M. F. A. Sayeedi, M. B. Hossain, M. K. Hassan, S. Afrin, M. M. Sabit, and M. S. Hossain, “Jailbreaktracer: Explainable detection of jailbreaking prompts in llms using synthetic data generation,”IEEE Access, 2025

  37. [37]

    Llm-sentry: A model-agnostic human-in-the-loop framework for securing large language models,

    S. Irtiza, K. A. Akbar, A. Yasmeen, L. Khan, O. Daescu, and B. Thurais- ingham, “Llm-sentry: A model-agnostic human-in-the-loop framework for securing large language models,” in2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA). IEEE, 2024, pp. 245–254

  38. [38]

    Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation,

    S. Zhang, Y . Zhai, K. Guo, H. Hu, S. Guo, Z. Fang, L. Zhao, C. Shen, C. Wang, and Q. Wang, “Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation,”arXiv preprint arXiv:2502.07557, 2025