arxiv: 2510.22628 · v2 · submitted 2025-10-26 · 💻 cs.CR · cs.AI

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

Md. Mehedi Hasan , Sk Tanzir Mehedi , Ziaur Rahman , Rafid Mostafiz , Md. Abir Hossain This is my paper

Pith reviewed 2026-05-18 04:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords adversarial defensejailbreak detectionprompt injectionLLM securitymultilingual moderationreal-time detectionhybrid classifier

0 comments

The pith

Sentra-Guard detects adversarial LLM prompts with 99.96 percent accuracy across more than 100 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Sentra-Guard, a real-time modular system designed to identify and block jailbreak and prompt injection attacks on large language models. The system combines FAISS-indexed SBERT embeddings for semantic capture with fine-tuned transformer classifiers and a fusion module that generates context-aware risk scores. A language-agnostic preprocessing step translates non-English prompts into English for consistent evaluation, while a human-in-the-loop feedback mechanism updates an evolving database of benign and malicious examples. A sympathetic reader would care because the reported results claim far lower attack success rates than current moderation tools, offering a transparent alternative that works with multiple LLM backends.

Core claim

Sentra-Guard identifies adversarial prompts in both direct and obfuscated forms through a hybrid architecture that pairs FAISS-indexed SBERT embedding representations with fine-tuned transformer classifiers. Its classifier-retriever fusion module dynamically computes context-aware risk scores, and a language-agnostic preprocessing layer translates non-English inputs into English to enable detection across over 100 languages. A HITL feedback loop maintains an evolving dual-labeled knowledge base of benign and malicious prompts, yielding a 99.96 percent detection rate with AUC and F1 scores of 1.00 and an attack success rate of only 0.004 percent that outperforms baselines such as LlamaGuard-2

What carries the argument

The classifier-retriever fusion module, which combines semantic embeddings from FAISS-indexed SBERT with fine-tuned transformers to compute context-aware risk scores that flag adversarial intent.

If this is right

The system supports fine-tuning and integration with diverse commercial and open-source LLM backends.
Continuous human review reduces false positives while adapting to new attack patterns.
Multilingual coverage through translation enables consistent protection without separate models per language.
Modular design allows scalable deployment for real-time use in production environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could make such hybrid defense layers a default component in public LLM APIs.
The approach might generalize to new attack types if the knowledge base grows through ongoing feedback.
Similar fusion techniques could apply to other AI safety tasks like detecting harmful content generation.

Load-bearing premise

Automatically translating non-English prompts into English preserves the semantic features that signal adversarial intent without distorting or losing attack signals.

What would settle it

Testing the full system on a fresh set of adversarial prompts written in a low-resource language where machine translation often alters subtle intent, then measuring whether detection rate drops below 99 percent.

Figures

Figures reproduced from arXiv: 2510.22628 by Md. Abir Hossain, Md. Mehedi Hasan, Rafid Mostafiz, Sk Tanzir Mehedi, Ziaur Rahman.

**Figure 2.** Figure 2: Sentra-Guard Architecture Overview: The framework translates non [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ROC Curve for Sentra-Guard (AUC = 1.00): The ROC curve [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Precision-Recall Curve of Sentra-Guard (F1 = 1.00): The curve [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sentra-Guard's near-perfect metrics look promising but hinge on unshown separation between test attacks and the indexed knowledge base.

read the letter

Colleague, the core takeaway is that this system combines retrieval, classification, translation, and human feedback into a deployable defense, yet the headline results cannot be assessed without seeing how the test prompts relate to the evolving knowledge base. The paper describes a modular setup that indexes SBERT embeddings with FAISS, runs fine-tuned transformers for risk scoring, translates non-English inputs, and updates a dual-labeled store through HITL loops. That specific assembly for real-time multilingual coverage is the concrete engineering step it contributes. The architecture is laid out plainly enough that someone could replicate the flow and adapt it to their own LLM stack. Transparency and backend compatibility are called out as advantages over purely black-box moderation tools. The multilingual preprocessing and continual adaptation via human review address real deployment needs that many current defenses ignore. The soft spot sits in the evaluation. The abstract gives 99.96 percent detection, perfect AUC and F1, and 0.004 percent attack success rate while beating LlamaGuard-2 and OpenAI Moderation, but supplies no dataset description, attack generation method, train-test split, or distance cutoff between indexed examples and test cases. If the adversarial prompts used for testing were drawn from or near the same distribution as the knowledge base, the numbers would largely reflect retrieval success rather than generalization to fresh jailbreaks. The translation layer's assumption that semantic attack signals survive conversion also goes untested in the provided summary. This work targets practitioners who need concrete, tunable components for production LLM safety rather than theorists seeking new principles. A reader building or hardening commercial systems could extract useful design patterns even if the numbers require verification. I would send it to peer review so the experimental section can be examined for proper controls and ablations; the practical framing is solid enough to justify referee time once the data details are supplied.

Referee Report

3 major / 2 minor

Summary. The paper presents Sentra-Guard, a real-time modular defense system for detecting and mitigating jailbreak and prompt injection attacks on LLMs. It describes a hybrid architecture combining FAISS-indexed SBERT embeddings with fine-tuned transformer classifiers, a classifier-retriever fusion module for context-aware risk scores, a language-agnostic preprocessing layer that translates non-English prompts to English, an evolving dual-labeled knowledge base, and a HITL feedback loop. The central empirical claim is a 99.96% detection rate (AUC = 1.00, F1 = 1.00) with an attack success rate of 0.004%, outperforming baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%).

Significance. If the performance claims are supported by properly separated evaluation data and rigorous methodology, the work would provide a practical, transparent, and multilingual contribution to LLM adversarial defense. The modular design and emphasis on continual adaptation via HITL and an evolving knowledge base are potentially useful for deployment. However, the near-perfect metrics require strong evidence of generalization to establish significance beyond the specific experimental conditions.

major comments (3)

Abstract: The reported 99.96% detection rate, AUC = 1.00, F1 = 1.00, and ASR of 0.004% are presented without any information on dataset composition, attack diversity, train-test splits, or statistical significance testing. This omission is load-bearing because the central claim of outperforming baselines cannot be evaluated without these details.
Classifier-retriever fusion module: The use of FAISS-indexed SBERT embeddings of the dual-labeled knowledge base to compute risk scores does not specify any temporal cutoff, embedding-distance threshold, or exclusion of evaluation prompts from the index. If test adversarial prompts are near-duplicates of indexed entries, the near-zero ASR would be an artifact of retrieval rather than generalization to novel jailbreaks.
Language-agnostic preprocessing layer: The claim that automatic translation of non-English prompts to English preserves semantic features needed to detect adversarial intent lacks supporting ablation or error analysis. This assumption is load-bearing for the multilingual resilience claim across over 100 languages.

minor comments (2)

Abstract: The specific ASR values for the baselines (LlamaGuard-2 at 1.3% and OpenAI Moderation at 3.7%) should be explicitly tied to the same evaluation protocol for fair comparison.
Ensure all acronyms (SBERT, FAISS, HITL, ASR, AUC) are defined on first use and that the manuscript includes a dedicated evaluation section with full experimental details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed to improve clarity or provide additional evidence, we indicate that changes will be made in the revised version.

read point-by-point responses

Referee: Abstract: The reported 99.96% detection rate, AUC = 1.00, F1 = 1.00, and ASR of 0.004% are presented without any information on dataset composition, attack diversity, train-test splits, or statistical significance testing. This omission is load-bearing because the central claim of outperforming baselines cannot be evaluated without these details.

Authors: We agree with the referee that the abstract should provide more context on the evaluation methodology to support the performance claims. In the revised version, we will modify the abstract to briefly note the dataset composition (including the mix of English and non-English prompts and attack categories), the use of held-out test sets with no overlap with training data, and the application of cross-validation for assessing statistical significance. This will allow readers to better evaluate the generalizability of the results without needing to refer to the full text immediately. revision: yes
Referee: Classifier-retriever fusion module: The use of FAISS-indexed SBERT embeddings of the dual-labeled knowledge base to compute risk scores does not specify any temporal cutoff, embedding-distance threshold, or exclusion of evaluation prompts from the index. If test adversarial prompts are near-duplicates of indexed entries, the near-zero ASR would be an artifact of retrieval rather than generalization to novel jailbreaks.

Authors: We thank the referee for highlighting this potential issue. Upon review, the current manuscript does not explicitly state the temporal cutoff or exclusion criteria. We will revise the description of the classifier-retriever fusion module to include: (1) a temporal cutoff ensuring the knowledge base contains only prompts from before the evaluation period, (2) an embedding-distance threshold for flagging near-duplicates, and (3) explicit exclusion of any evaluation prompts that match indexed entries above the threshold. This revision will demonstrate that the low ASR is due to generalization. revision: yes
Referee: Language-agnostic preprocessing layer: The claim that automatic translation of non-English prompts to English preserves semantic features needed to detect adversarial intent lacks supporting ablation or error analysis. This assumption is load-bearing for the multilingual resilience claim across over 100 languages.

Authors: We acknowledge that the manuscript lacks a dedicated ablation study or error analysis for the translation component. To address this, we will add a new subsection in the Experiments section providing an ablation study on the impact of the preprocessing layer, including performance metrics with and without translation for non-English prompts, and an analysis of translation errors and their effect on detection accuracy. This will strengthen the multilingual resilience claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with independent evaluation metrics

full rationale

The paper presents an architectural framework for Sentra-Guard using FAISS-indexed SBERT embeddings, fine-tuned transformers, a classifier-retriever fusion module, and a language-agnostic preprocessing layer, along with reported empirical results such as 99.96% detection rate and AUC=1.00. No equations, derivations, or self-definitional reductions are present in the provided text. The evolving dual-labeled knowledge base is described as an input to the system rather than a fitted output that forces the evaluation metrics. The high performance is framed as an outcome of testing against baselines, with no quoted mechanism showing that test prompts reduce by construction to indexed training entries or that results are renamed fitted parameters. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The system description relies on standard components (SBERT embeddings, FAISS indexing, transformer fine-tuning, machine translation) whose performance properties are taken from prior literature; the fusion module and risk-score computation are introduced without explicit parameter counts or independent validation.

free parameters (1)

risk score threshold
The decision boundary that converts the fused context-aware risk score into a binary block/allow decision is necessarily tuned on data but is not quantified in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1377 out tokens · 51758 ms · 2026-05-18T04:18:22.130675+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid architecture with FAISS-indexed SBERT embedding representations... classifier-retriever fusion module... decision fusion aggregator
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multilingual normalization... evolving dual-labeled knowledge base

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

Privacy in large language models: Attacks, defenses and future directions,

H. Li, Y . Chen, J. Luo, J. Wang, H. Peng, Y . Kang, X. Zhang, Q. Hu, C. Chan, Z. Xuet al., “Privacy in large language models: Attacks, defenses and future directions,”arXiv preprint arXiv:2310.10383, 2023

work page arXiv 2023
[2]

Unleashing the potential of prompt engineering for large language models,

B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,”Patterns, 2025

work page 2025
[3]

Jailbreaking and mitigation of vulnerabilities in large language models,

B. Peng, K. Chen, Q. Niu, Z. Bi, M. Liu, P. Feng, T. Wang, L. K. Yan, Y . Wen, Y . Zhanget al., “Jailbreaking and mitigation of vulnerabilities in large language models,”arXiv preprint arXiv:2410.15236, 2024

work page arXiv 2024
[4]

Attack and defense techniques in large language models: A survey and new perspectives,

Z. Liao, K. Chen, Y . Lin, K. Li, Y . Liu, H. Chen, X. Huang, and Y . Yu, “Attack and defense techniques in large language models: A survey and new perspectives,”arXiv preprint arXiv:2505.00976, 2025

work page arXiv 2025
[5]

Large language models for cyber security: A systematic literature review,

H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”arXiv preprint arXiv:2405.04760, 2024

work page arXiv 2024
[6]

Jailguard: A universal detection framework for prompt-based attacks on llm systems,

X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, M. Hu, J. Zhang, Y . Liu, S. Ma, and C. Shen, “Jailguard: A universal detection framework for prompt-based attacks on llm systems,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[7]

A survey on large language models with multilingualism: Recent advances and new frontiers,

K. Huang, F. Mo, X. Zhang, H. Li, Y . Li, Y . Zhang, W. Yi, Y . Mao, J. Liu, Y . Xuet al., “A survey on large language models with multilingualism: Recent advances and new frontiers,”arXiv preprint arXiv:2405.10936, 2024

work page arXiv 2024
[8]

The many faces of generalization: From traditional machine learning to llm safety,

S. Zhu, “The many faces of generalization: From traditional machine learning to llm safety,” Ph.D. dissertation, University of Maryland, College Park, 2025

work page 2025
[9]

Robustness in large language models: A survey of mitigation strategies and evaluation metrics,

P. Kumar and S. Mishra, “Robustness in large language models: A survey of mitigation strategies and evaluation metrics,”arXiv preprint arXiv:2505.18658, 2025

work page arXiv 2025
[10]

Didots: Knowledge distillation from large-language-models for dementia obfuscation in transcribed speech,

D. Woszczyk and S. Demetriou, “Didots: Knowledge distillation from large-language-models for dementia obfuscation in transcribed speech,” arXiv preprint arXiv:2410.04188, 2024

work page arXiv 2024
[11]

Adversarial prompt transformation for systematic jailbreaks of llms,

K. E. Awoufack, “Adversarial prompt transformation for systematic jailbreaks of llms,” Ph.D. dissertation, Massachusetts Institute of Tech- nology, 2024

work page 2024
[12]

Survey of vulnerabilities in large language models revealed by 15 adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023
[13]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

J. Luo, W. Zhang, Y . Yuan, Y . Zhao, J. Yang, Y . Gu, B. Wu, B. Chen, Z. Qiao, Q. Longet al., “Large language model agent: A survey on methodology, applications and challenges,”arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A state-of- the-art review on phishing website detection techniques,

W. Li, S. Manickam, Y .-W. Chong, W. Leng, and P. Nanda, “A state-of- the-art review on phishing website detection techniques,”IEEE Access, 2024

work page 2024
[15]

The role of transformer models in advancing blockchain technology: A systematic survey,

T. Liu, Y . Wang, J. Sun, Y . Tian, Y . Huang, T. Xue, P. Li, and Y . Liu, “The role of transformer models in advancing blockchain technology: A systematic survey,”arXiv preprint arXiv:2409.02139, 2024

work page arXiv 2024
[16]

Comparison of the novel probabilistic self-optimizing vectorized earth observation retrieval classifier with common machine learning algorithms,

J. P. Musial and J. S. Bojanowski, “Comparison of the novel probabilistic self-optimizing vectorized earth observation retrieval classifier with common machine learning algorithms,”Remote Sensing, vol. 14, no. 2, p. 378, 2022

work page 2022
[17]

Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms,

F. Askari, A. Fateh, and M. R. Mohammadi, “Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms,”Neural Networks, vol. 187, p. 107339, 2025

work page 2025
[18]

Retrieval-Augmented Generation with Graphs (GraphRAG)

H. Han, Y . Wang, H. Shomer, K. Guo, J. Ding, Y . Lei, M. Halappanavar, R. A. Rossi, S. Mukherjee, X. Tanget al., “Retrieval-augmented generation with graphs (graphrag),”arXiv preprint arXiv:2501.00309, 2024

work page internal anchor Pith review arXiv 2024
[19]

Autodefense: Multi-agent llm defense against jailbreak attacks,

Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “Autodefense: Multi-agent llm defense against jailbreak attacks,”arXiv preprint arXiv:2403.04783, 2024

work page arXiv 2024
[20]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

E. Durmus, K. Nguyen, T. I. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield-Dodds, D. Hernandez, N. Josephet al., “Towards measuring the representation of subjective global opinions in language models,”arXiv preprint arXiv:2306.16388, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Algorithms for adversarially robust deep learning,

A. B. Robey, “Algorithms for adversarially robust deep learning,” Ph.D. dissertation, University of Pennsylvania, 2024

work page 2024
[23]

” do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

work page 2024
[24]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[25]

Red teaming contemporary ai models: Insights from spanish and basque perspectives,

M. Romero-Arjona, P. Valle, J. C. Alonso, A. B. S ´anchez, M. Ugarte, A. Cazalilla, V . Cambr´on, J. A. Parejo, A. Arrieta, and S. Segura, “Red teaming contemporary ai models: Insights from spanish and basque perspectives,”arXiv preprint arXiv:2503.10192, 2025

work page arXiv 2025
[26]

A cross-language investigation into jailbreak attacks in large language models,

J. Li, Y . Liu, C. Liu, L. Shi, X. Ren, Y . Zheng, Y . Liu, and Y . Xue, “A cross-language investigation into jailbreak attacks in large language models,”arXiv preprint arXiv:2401.16765, 2024

work page arXiv 2024
[27]

Goal-guided generative prompt injection attack on large language models,

C. Zhang, M. Jin, Q. Yu, C. Liu, H. Xue, and X. Jin, “Goal-guided generative prompt injection attack on large language models,” in2024 IEEE International Conference on Data Mining (ICDM). IEEE, 2024, pp. 941–946

work page 2024
[28]

from benign import toxic: Jailbreaking the language model via adversarial metaphors,

Y . Yan, S. Sun, Z. Duan, T. Liu, M. Liu, Z. Yin, J. Lei, and Q. Li, “from benign import toxic: Jailbreaking the language model via adversarial metaphors,”arXiv preprint arXiv:2503.00038, 2025

work page arXiv 2025
[29]

A comprehensive overview of large lan- guage models (llms) for cyber defences: Opportunities and directions,

M. Hassanin and N. Moustafa, “A comprehensive overview of large lan- guage models (llms) for cyber defences: Opportunities and directions,” arXiv preprint arXiv:2405.14487, 2024

work page arXiv 2024
[30]

A comprehensive review of adversarial attacks and defense strategies in deep neural networks,

A. Abomakhelb, K. A. Jalil, A. G. Buja, A. Alhammadi, and A. M. Alenezi, “A comprehensive review of adversarial attacks and defense strategies in deep neural networks,”Technologies, vol. 13, no. 5, p. 202, 2025

work page 2025
[31]

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,

Y . Zhou, T. Ni, W.-B. Lee, and Q. Zhao, “A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,” arXiv preprint arXiv:2502.05224, 2025

work page arXiv 2025
[32]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,

H. Jin, L. Hu, X. Li, P. Zhang, C. Chen, J. Zhuang, and H. Wang, “Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large lan- guage and vision-language models,”arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024
[33]

A survey on human-in-the-loop applications towards an internet of all,

D. S. Nunes, P. Zhang, and J. S. Silva, “A survey on human-in-the-loop applications towards an internet of all,”IEEE Communications Surveys & Tutorials, vol. 17, no. 2, pp. 944–965, 2015

work page 2015
[34]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,”arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Applications, challenges, and future directions of human-in-the-loop learning,

S. Kumar, S. Datta, V . Singh, D. Datta, S. K. Singh, and R. Sharma, “Applications, challenges, and future directions of human-in-the-loop learning,”IEEE Access, vol. 12, pp. 75 735–75 760, 2024

work page 2024
[36]

Jailbreaktracer: Explainable detection of jailbreaking prompts in llms using synthetic data generation,

M. F. A. Sayeedi, M. B. Hossain, M. K. Hassan, S. Afrin, M. M. Sabit, and M. S. Hossain, “Jailbreaktracer: Explainable detection of jailbreaking prompts in llms using synthetic data generation,”IEEE Access, 2025

work page 2025
[37]

Llm-sentry: A model-agnostic human-in-the-loop framework for securing large language models,

S. Irtiza, K. A. Akbar, A. Yasmeen, L. Khan, O. Daescu, and B. Thurais- ingham, “Llm-sentry: A model-agnostic human-in-the-loop framework for securing large language models,” in2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA). IEEE, 2024, pp. 245–254

work page 2024
[38]

Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation,

S. Zhang, Y . Zhai, K. Guo, H. Hu, S. Guo, Z. Fang, L. Zhao, C. Shen, C. Wang, and Q. Wang, “Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation,”arXiv preprint arXiv:2502.07557, 2025

work page arXiv 2025