A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Abrar Alotaibi; Moataz Ahmed; Raed Mughus

arxiv: 2606.25476 · v1 · pith:BEKQ5EY6new · submitted 2026-06-24 · 💻 cs.CL · cs.AI

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Abrar Alotaibi , Raed Mughus , Moataz Ahmed This is my paper

Pith reviewed 2026-06-25 20:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords red teamingLLM faithfulnessadversarial promptsmulti-role architecturequestion answeringsummarizationmodel safetyvulnerability evaluation

0 comments

The pith

A three-model red teaming setup with attacker and jury roles exposes up to 7.9% more unfaithful LLM responses in question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a red teaming framework that uses three coordinated language models to find weaknesses in how LLMs stay faithful to given information. One model answers questions or summarizes text, a second generates adversarial prompts to trick it, and a third judges whether the answers stay accurate and consistent. In tests this raised the rate of successful attacks by as much as 7.9 percent on question-answering tasks and worked on both English and Arabic material. The results also indicate that how a model is built matters more for safety than simply increasing its size.

Core claim

The central claim is that a multi-role architecture of target, attacker, and jury models can systematically uncover vulnerabilities in LLM faithfulness, with exploitative adversarial prompts increasing attack success rate by up to 7.9% in QA tasks, and that design choices outweigh parameter scaling for model safety. The framework adapts across tasks and languages while revealing how output-format constraints affect vulnerability patterns.

What carries the argument

The multi-role architecture with target model (generates responses), attacker model (creates adversarial prompts), and jury model (evaluates accuracy and consistency).

If this is right

Exploitative adversarial prompts increase detected unfaithfulness by up to 7.9% in question-answering tasks.
Format limitations in summarization tasks produce measurable gains in faithfulness.
Architectural design choices typically outweigh parameter scaling in determining model safety.
The framework enables direct comparison of vulnerabilities across English question-answering and Arabic summarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attacker-jury loop could be run repeatedly on an updated model to track whether specific vulnerabilities persist after retraining.
Extending the jury to detect subtle inconsistencies beyond explicit contradictions would require additional evaluation signals.
The observed advantage of architecture over scale suggests that safety testing should prioritize controlled architectural variants rather than larger models alone.

Load-bearing premise

The jury model can rigorously and unbiasedly evaluate response accuracy and consistency across tasks and languages.

What would settle it

Human raters scoring the same set of target-model responses for faithfulness produce attack-success rates that differ substantially from the jury model's rates.

read the original abstract

Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising target, attacker, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing weaknesses in reliability. The approach identifies how structural constraints in summarization can shape vulnerability patterns, with format limitations yielding measurable gains in faithfulness, and shows that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength is its adaptability across evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While it excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating adversarial prompt generation across languages. Our experiments also reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly across linguistic contexts. Overall, this architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models evolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 7.9% attack success rate gain rests on jury model verdicts that the paper itself says struggle with subtle and cross-lingual cases, with no external validation shown.

read the letter

The main thing to know is that the central quantitative claim—an up to 7.9% rise in attack success rate from exploitative prompts in QA—comes straight from the jury model's faithfulness judgments. The abstract flags that this jury has trouble spotting subtle unfaithfulness and handling cross-lingual cases, yet nothing in the text shows human agreement checks, inter-annotator metrics, or an ablation with another evaluator.

The new piece is the three-role setup with target, attacker, and jury models. They run it on English question-answering and Arabic summarization to compare vulnerabilities across languages and tasks. That cross-lingual reach is a practical step beyond single-language red teaming.

The work does a reasonable job showing how format constraints in summarization can produce measurable faithfulness gains and that design choices seem to matter more than raw parameter count for safety outcomes. The framework's flexibility for different tasks is a clear plus.

The soft spot is exactly the one in the stress-test note. The attack success rate and the architecture-over-scale conclusion both depend on the jury as the sole measurement tool. Without validation, those numbers are hard to trust. The abstract already calls out the jury's limits, but the experiments do not address them with additional evidence. No baselines, statistical tests, or controls are described in the available text either.

This is for researchers working on LLM safety evaluation who need reusable methods that handle multiple languages. Someone looking for architecture ideas could get value from the multi-role design, but anyone wanting reliable numbers would need the jury reliability fixed.

I would send it to peer review. The idea has enough substance to justify referee time, though the evaluation side needs strengthening.

Referee Report

2 major / 1 minor

Summary. The paper introduces a red teaming framework for LLMs using a multi-role architecture (target, attacker, and jury models) to generate adversarial prompts and evaluate faithfulness. In a case study on question-answering and summarization tasks (including cross-lingual Arabic), it reports that exploitative adversarial prompts raise attack success rate by up to 7.9% in QA, that format constraints in summarization improve faithfulness, and that architectural design choices outweigh parameter scaling for model safety. The framework is presented as adaptable across tasks and languages but notes challenges in full automation and detecting subtle unfaithfulness.

Significance. If the jury-based measurements prove reliable, the work would offer a practical, extensible methodology for systematic vulnerability discovery in LLMs, with concrete evidence that prompt exploitation and structural constraints affect faithfulness more than scale alone. This could inform safety evaluation practices, especially for cross-lingual settings.

major comments (2)

[Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.
[Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.

minor comments (1)

[Abstract] The abstract would be clearer if it named the concrete LLMs assigned to the target, attacker, and jury roles and stated the number of prompts or examples per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, focusing on the evaluation methodology and abstract claims.

read point-by-point responses

Referee: [Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.

Authors: We agree the reported 7.9% ASR increase and related conclusions rest on jury-model verdicts. The abstract already flags limitations in detecting subtle unfaithfulness and cross-lingual issues, and the framework is explicitly designed around automated jury evaluation rather than human annotation. No human agreement study or ablation is present because the work focuses on the multi-role automated pipeline. We will revise the manuscript to add an explicit statement in the abstract and a dedicated paragraph in the limitations section clarifying that all quantitative results derive from jury verdicts and discussing this as a methodological choice. revision: partial
Referee: [Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.

Authors: The claim is grounded in the experimental comparisons across model families presented in the results section. We will revise the abstract to include brief references to the specific model pairs and scale ranges used, while noting that the verdicts come from the jury component of the framework. Perfect isolation of architecture from scale is inherently difficult, but our controlled task setups hold other variables fixed; we will add a short clarifying sentence to this effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experimental outcomes

full rationale

The paper describes an empirical red-teaming framework and case study reporting attack success rates (e.g., up to 7.9% increase) measured via a jury model. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. The central quantitative claims derive from reported experimental results rather than reducing to inputs by construction. The paper explicitly notes limitations in the jury (subtle unfaithfulness, cross-lingual cases), which is consistent with an externally falsifiable measurement approach rather than circularity. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the assumption that LLM-based attackers and juries can be used effectively for evaluation; no free parameters or invented physical entities are present.

axioms (2)

domain assumption LLM-based attacker models can generate increasingly effective adversarial prompts that expose unfaithfulness
This is required for the reported increase in attack success rate.
domain assumption Jury models provide an objective measure of accuracy and consistency
This underpins the measurement of the 7.9% gain and cross-task comparisons.

invented entities (1)

Multi-role architecture comprising target, attacker, and jury models no independent evidence
purpose: To systematically uncover vulnerabilities in LLM outputs
Presented as the core novel component of the framework

pith-pipeline@v0.9.1-grok · 5794 in / 1459 out tokens · 26607 ms · 2026-06-25T20:56:51.246391+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 4 canonical work pages

[1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2401.02954 (2024)

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

Pith/arXiv arXiv 2024
[3]

Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems36(2023)

2023
[4]

In: Muresan, S., Nakov, P., Villavicencio, A

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 3419–3448 (2022) https://doi.org/10.18653/v1/2022. emnlp-main.225

work page doi:10.18653/v1/2022 2022
[5]

arXiv preprint arXiv:2406.11036 (2024)

Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., Inie, N.: garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036 (2024)

arXiv 2024
[6]

arXiv preprint arXiv:2209.07858 (2022)

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al.: Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022)

Pith/arXiv arXiv 2022
[7]

USENIX Security Symposium (2020)

Carlini, N., Tram` er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D., Erlingsson, ´U., Oprea, A., Raffel, C.: Extract- ing training data from large language models. USENIX Security Symposium (2020)

2020
[8]

Advances in Neural Information Processing Systems37, 33402–33422 (2024)

Wu, K., Wu, E., Zou, J.Y.: Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems37, 33402–33422 (2024)

2024
[9]

arXiv preprint arXiv:2309.05922 (2023)

Evans, R., Gao, L., Zhang, W., et al.: Hallucination in large language models: A survey of detection, attribution, and mitigation. arXiv preprint arXiv:2309.05922 (2023)

Pith/arXiv arXiv 2023
[10]

arXiv preprint arXiv:2311.03274 (2023)

Zhou, L., Zhang, N., Wang, Y.: Detecting hallucinated content in large language model outputs. arXiv preprint arXiv:2311.03274 (2023)

arXiv 2023
[11]

Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

Jabbar, M.S., Al-Azani, S., Alotaibi, A., Ahmed, M.: Red teaming large language models: A comprehensive review and critical analysis. Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

work page doi:10.1016/j.ipm.2025.104239 2025
[12]

arXiv preprint arXiv:2306.11507 (2023)

Lin, Y., Hou, Y., Li, C., Gu, Y., Feng, C., Chen, W., Wang, W.: Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023)

arXiv 2023
[13]

Journal of Artificial Intelligence Research75, 45–78 (2024)

Ahmed, M., Ali, H.: Challenges in arabic-english cross-lingual language models. Journal of Artificial Intelligence Research75, 45–78 (2024)

2024
[14]

Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications (2023)

2023
[15]

Weight Poisoning Attacks on Pretrained Models

Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained mod- els. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2793–2806 (2020) https://doi.org/10.18653/v1/2020.acl-main.249

work page doi:10.18653/v1/2020.acl-main.249 2020
[16]

8th International Conference on Learning Representations, ICLR 2020 (2019)

Krishna, K., Tomar, G.S., Parikh, A.P., Papernot, N., Iyyer, M.: Thieves on sesame street! model extraction of bert-based apis. 8th International Conference on Learning Representations, ICLR 2020 (2019)

2020
[17]

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study (2023)

2023
[18]

arXiv preprint arXiv:2310.08859 (2023)

Wang, Z., Li, C., Zhang, T.: Multilingual security challenges in large language models. arXiv preprint arXiv:2310.08859 (2023)

arXiv 2023
[19]

arXiv preprint arXiv:2310.15140 (2023)

Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., Sun, T.: Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140 (2023)

arXiv 2023
[20]

arXiv preprint arXiv:2312.02119 (2023)

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119 (2023)

arXiv 2023
[21]

arXiv preprint arXiv:2311.07689 (2023)

Ge, S., Zhou, C., Hou, R., Khabsa, M., Wang, Y.-C., Wang, Q., Han, J., Mao, Y.: Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023)

arXiv 2023
[22]

arXiv preprint arXiv:2410.01606 (2024)

Pavlova, M., Brinkman, E., Iyer, K., Albiero, V., Bitton, J., Nguyen, H., Li, J., Ferrer, C.C., Evtimov, I., Grattafiori, A.: Automated red teaming with goat: the generative offensive agent tester. arXiv preprint arXiv:2410.01606 (2024)

arXiv 2024
[23]

arXiv preprint arXiv:2402.04249 (2024) 38

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024) 38

Pith/arXiv arXiv 2024
[24]

arXiv preprint arXiv:2310.06474 (2023)

Deng, Y., Zhang, W., Pan, S.J., Bing, L.: Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474 (2023)

arXiv 2023
[25]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

2023
[26]

arXiv preprint arXiv:2401.05561 (2024)

Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

Pith/arXiv arXiv 2024
[27]

IEEE Software40(3), 4–8 (2023)

Ozkaya, I.: Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software40(3), 4–8 (2023)

2023
[28]

Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

Siino, M., Tinnirello, I., Cascia, M.L.: From foundations to gpt in text clas- sification: A comprehensive survey on current approaches and future trends. Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

2025
[29]

arXiv preprint arXiv:2004.06660 (2020)

Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660 (2020)

arXiv 2004
[30]

In: 2022 IEEE Symposium on Security and Privacy (SP), pp

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914 (2022). IEEE

2022
[31]

arXiv preprint arXiv:2010.12563 (2020)

Wallace, E., Zhao, T.Z., Feng, S., Singh, S.: Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563 (2020)

arXiv 2010
[32]

arXiv preprint arXiv:2410.12855 (2024)

Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H.: Jailjudge: A comprehen- sive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855 (2024)

arXiv 2024
[33]

arXiv preprint arXiv:2310.02446 (2023)

Yong, Z.-X., Menghini, C., Bach, S.H.: Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023)

Pith/arXiv arXiv 2023
[34]

arXiv preprint arXiv:2311.12445 (2023)

Zhao, Y., Wang, L., Chen, X.: Privacy concerns in multilingual language models. arXiv preprint arXiv:2311.12445 (2023)

arXiv 2023
[35]

arXiv preprint arXiv:2311.03348 (2023)

Shah, R., Pour, S., Tagade, A., Casper, S., Rando, J., et al.: Scalable and trans- ferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348 (2023)

arXiv 2023
[36]

arXiv preprint arXiv:2405.21018 (2024)

Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

arXiv 2024
[37]

39 arXiv preprint arXiv:2311.05232 (2023)

Wang, Z., Shen, T., Huang, Z., Lu, H.: A survey on language model hallucination. 39 arXiv preprint arXiv:2311.05232 (2023)

Pith/arXiv arXiv 2023
[38]

In: Proceedings of EMNLP (2023)

Zhang, T., Wang, Y., Chen, H.,et al.: Measuring and mitigating hallucination in summarization. In: Proceedings of EMNLP (2023)

2023
[39]

arXiv preprint arXiv:2312.05209 (2023)

Kim, J.-W., Park, J., Cho, K.: Cross-lingual hallucination detection in large language models. arXiv preprint arXiv:2312.05209 (2023)

arXiv 2023
[40]

In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp

Siino, M.: Brainllama at semeval-2024 task 6: Prompting llama to detect hallu- cinations and related observable overgeneration mistakes. In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp. 82–87 (2024)

2024
[41]

arXiv preprint arXiv:2310.12534 (2023)

Lee, K., Kim, T., Park, C.: Translation-induced hallucination in multilingual models. arXiv preprint arXiv:2310.12534 (2023)

arXiv 2023
[42]

In: Proceedings of ACL (2023)

Chang, M., Henderson, P.,et al.: Cultural alignment and bias in large language models. In: Proceedings of ACL (2023)

2023
[43]

arXiv preprint arXiv:2311.07468 (2023)

Wu, X., Zhang, C., Li, W.: Multilingual consistency in large language models. arXiv preprint arXiv:2311.07468 (2023)

arXiv 2023
[44]

Computational Linguistics50(1), 89–124 (2024)

Zhao, L., Kumar, R.: Cross-cultural semantic preservation in multilingual lan- guage models. Computational Linguistics50(1), 89–124 (2024)

2024
[45]

arXiv preprint arXiv:2310.00905 (2023)

Wang, W., Tu, Z., Chen, C., Yuan, Y., Huang, J.-t., Jiao, W., Lyu, M.R.: All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905 (2023)

arXiv 2023
[46]

In: Findings of ACL (2023)

Wang, Y., Chen, H.,et al.: Measuring factual consistency in large language model outputs. In: Findings of ACL (2023)

2023
[47]

arXiv preprint arXiv:2312.09036 (2023)

Chen, Y., Liu, Y., Zhang, W.: Cross-lingual consistency checking for large language models. arXiv preprint arXiv:2312.09036 (2023)

arXiv 2023
[48]

arXiv preprint arXiv:2311.12024 (2023)

Thompson, S., Chen, D.: A taxonomy of hallucination patterns in large language models. arXiv preprint arXiv:2311.12024 (2023)

arXiv 2023
[49]

arXiv preprint arXiv:2311.09801 (2023)

Liu, X., Zhang, W., et al.: A systematic analysis of jailbreaking and response unfaithfulness in large language models. arXiv preprint arXiv:2311.09801 (2023)

arXiv 2023
[50]

arXiv preprint arXiv:2202.03286 (2022)

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022)

Pith/arXiv arXiv 2022
[51]

arXiv preprint arXiv:2401.16656 (2024) 40

Wichers, N., Denison, C., Beirami, A.: Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656 (2024) 40

arXiv 2024
[52]

In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you ´ ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp. 79–90 (2023)

2023
[53]

arXiv preprint arXiv:2402.08679 (2024)

Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

arXiv 2024
[54]

arXiv preprint arXiv:2311.08268 (2023)

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep´ s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

arXiv 2023
[55]

arXiv preprint arXiv:2406.01288 (2024)

Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.: Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288 (2024)

arXiv 2024
[56]

arXiv preprint arXiv:2407.16667 (2024)

Xu, H., Zhang, W., Wang, Z., Xiao, F., Zheng, R., Feng, Y., Ba, Z., Ren, K.: Redagent: Red teaming large language models with context-aware autonomous language agent. arXiv preprint arXiv:2407.16667 (2024)

arXiv 2024
[57]

arXiv preprint arXiv:2407.03876 (2024)

Jiang, B., Jing, Y., Shen, T., Wu, T., Yang, Q., Xiong, D.: Automated progressive red teaming. arXiv preprint arXiv:2407.03876 (2024)

arXiv 2024
[58]

arXiv preprint arXiv:2301.02344 (2023)

Aghakhani, H., Dai, W., Manoel, A., Fernandes, X., Kharkar, A., Kruegel, C., Vigna, G., Evans, D., Zorn, B., Sim, R.: Trojanpuzzle: Covertly poisoning code- suggestion models. arXiv preprint arXiv:2301.02344 (2023)

arXiv 2023
[59]

SQ u AD : 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1...

work page doi:10.18653/v1/d16-1264 2016
[60]

Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

Guellil, I., Saˆ adane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural lan- guage processing: An overview. Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

2021
[61]

arXiv preprint arXiv:2407.15390 (2024)

Bari, M.S., Alnumay, Y., Alzahrani, N.A., Alotaibi, N.M., Alyahya, H.A., AlRashed, S., Mirza, F.A., Alsubaie, S.Z., Alahmed, H.A., Alabduljabbar, G., et al.: Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390 (2024)

arXiv 2024
[62]

arXiv preprint arXiv:2005.00661 (2020)

Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)

arXiv 2005
[63]

arXiv preprint arXiv:1910.12840 (2019)

Kry´ sci´ nski, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual 41 consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)

arXiv 1910
[64]

arXiv preprint arXiv:2102.09130 (2021)

Nan, F., Nallapati, R., Wang, Z., Santos, C.N.d., Zhu, H., Zhang, D., McKeown, K., Xiang, B.: Entity-level factual consistency of abstractive text summarization. arXiv preprint arXiv:2102.09130 (2021)

arXiv 2021
[65]

Transactions of the Association for Computational Linguistics10, 163–177 (2022)

Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022)

2022
[66]

arXiv preprint arXiv:2210.07197 (2022)

Zhong, M., Liu, Y., Yin, D., Mao, Y., Jiao, Y., Liu, P., Zhu, C., Ji, H., Han, J.: Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022)

arXiv 2022
[67]

arXiv preprint arXiv:2305.13194 (2023)

Clark, E., Rijhwani, S., Gehrmann, S., Maynez, J., Aharoni, R., Nikolaev, V., Sellam, T., Siddhant, A., Das, D., Parikh, A.P.: Seahorse: A multilingual, mul- tifaceted dataset for summarization evaluation. arXiv preprint arXiv:2305.13194 (2023)

arXiv 2023
[68]

arXiv preprint arXiv:2104.13346 (2021)

Pagnoni, A., Balachandran, V., Tsvetkov, Y.: Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346 (2021)

arXiv 2021
[69]

arXiv preprint arXiv:2410.15236 (2024)

Peng, B., Bi, Z., Niu, Q., Liu, M., Feng, P., Wang, T., Yan, L.K., Wen, Y., Zhang, Y., Yin, C.H.: Jailbreaking and mitigation of vulnerabilities in large language models. arXiv preprint arXiv:2410.15236 (2024)

Pith/arXiv arXiv 2024
[70]

arXiv preprint arXiv:2307.15043 (2023)

Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Pith/arXiv arXiv 2023
[71]

arXiv preprint arXiv:2305.13860 (2023)

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

Pith/arXiv arXiv 2023
[72]

arXiv preprint arXiv:2310.08419 (2023)

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jail- breaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023)

Pith/arXiv arXiv 2023
[73]

arXiv preprint arXiv:2404.01318 (2024)

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 (2024)

Pith/arXiv arXiv 2024
[74]

arXiv preprint arXiv:2408.04686 42 (2024)

Sun, X., Zhang, D., Yang, D., Zou, Q., Li, H.: Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686 42 (2024)

arXiv 2024
[75]

arXiv preprint arXiv:2206.07682 (2022)

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)

Pith/arXiv arXiv 2022
[76]

arXiv preprint arXiv:2001.08361 (2020)

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

Pith/arXiv arXiv 2001
[77]

arXiv preprint arXiv:2306.09479 (2023)

McKenzie, I.R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., McLean, E., Kirtland, A., Ross, A., Liu, A., et al.: Inverse scaling: When bigger isn´t better. arXiv preprint arXiv:2306.09479 (2023)

arXiv 2023
[78]

arXiv preprint arXiv:1904.09751 (2019)

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)

Pith/arXiv arXiv 1904
[79]

arXiv preprint arXiv:1909.05858 (2019)

Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: A con- ditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)

Pith/arXiv arXiv 1909
[80]

arXiv preprint arXiv:2306.02561 (2023)

Jiang, D., Ren, X., Lin, B.Y.: Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 (2023)

arXiv 2023

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023

[2] [2]

arXiv preprint arXiv:2401.02954 (2024)

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

Pith/arXiv arXiv 2024

[3] [3]

Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems36(2023)

2023

[4] [4]

In: Muresan, S., Nakov, P., Villavicencio, A

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 3419–3448 (2022) https://doi.org/10.18653/v1/2022. emnlp-main.225

work page doi:10.18653/v1/2022 2022

[5] [5]

arXiv preprint arXiv:2406.11036 (2024)

Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., Inie, N.: garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036 (2024)

arXiv 2024

[6] [6]

arXiv preprint arXiv:2209.07858 (2022)

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al.: Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022)

Pith/arXiv arXiv 2022

[7] [7]

USENIX Security Symposium (2020)

Carlini, N., Tram` er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D., Erlingsson, ´U., Oprea, A., Raffel, C.: Extract- ing training data from large language models. USENIX Security Symposium (2020)

2020

[8] [8]

Advances in Neural Information Processing Systems37, 33402–33422 (2024)

Wu, K., Wu, E., Zou, J.Y.: Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems37, 33402–33422 (2024)

2024

[9] [9]

arXiv preprint arXiv:2309.05922 (2023)

Evans, R., Gao, L., Zhang, W., et al.: Hallucination in large language models: A survey of detection, attribution, and mitigation. arXiv preprint arXiv:2309.05922 (2023)

Pith/arXiv arXiv 2023

[10] [10]

arXiv preprint arXiv:2311.03274 (2023)

Zhou, L., Zhang, N., Wang, Y.: Detecting hallucinated content in large language model outputs. arXiv preprint arXiv:2311.03274 (2023)

arXiv 2023

[11] [11]

Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

Jabbar, M.S., Al-Azani, S., Alotaibi, A., Ahmed, M.: Red teaming large language models: A comprehensive review and critical analysis. Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

work page doi:10.1016/j.ipm.2025.104239 2025

[12] [12]

arXiv preprint arXiv:2306.11507 (2023)

Lin, Y., Hou, Y., Li, C., Gu, Y., Feng, C., Chen, W., Wang, W.: Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023)

arXiv 2023

[13] [13]

Journal of Artificial Intelligence Research75, 45–78 (2024)

Ahmed, M., Ali, H.: Challenges in arabic-english cross-lingual language models. Journal of Artificial Intelligence Research75, 45–78 (2024)

2024

[14] [14]

Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications (2023)

2023

[15] [15]

Weight Poisoning Attacks on Pretrained Models

Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained mod- els. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2793–2806 (2020) https://doi.org/10.18653/v1/2020.acl-main.249

work page doi:10.18653/v1/2020.acl-main.249 2020

[16] [16]

8th International Conference on Learning Representations, ICLR 2020 (2019)

Krishna, K., Tomar, G.S., Parikh, A.P., Papernot, N., Iyyer, M.: Thieves on sesame street! model extraction of bert-based apis. 8th International Conference on Learning Representations, ICLR 2020 (2019)

2020

[17] [17]

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study (2023)

2023

[18] [18]

arXiv preprint arXiv:2310.08859 (2023)

Wang, Z., Li, C., Zhang, T.: Multilingual security challenges in large language models. arXiv preprint arXiv:2310.08859 (2023)

arXiv 2023

[19] [19]

arXiv preprint arXiv:2310.15140 (2023)

Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., Sun, T.: Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140 (2023)

arXiv 2023

[20] [20]

arXiv preprint arXiv:2312.02119 (2023)

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119 (2023)

arXiv 2023

[21] [21]

arXiv preprint arXiv:2311.07689 (2023)

Ge, S., Zhou, C., Hou, R., Khabsa, M., Wang, Y.-C., Wang, Q., Han, J., Mao, Y.: Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023)

arXiv 2023

[22] [22]

arXiv preprint arXiv:2410.01606 (2024)

Pavlova, M., Brinkman, E., Iyer, K., Albiero, V., Bitton, J., Nguyen, H., Li, J., Ferrer, C.C., Evtimov, I., Grattafiori, A.: Automated red teaming with goat: the generative offensive agent tester. arXiv preprint arXiv:2410.01606 (2024)

arXiv 2024

[23] [23]

arXiv preprint arXiv:2402.04249 (2024) 38

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024) 38

Pith/arXiv arXiv 2024

[24] [24]

arXiv preprint arXiv:2310.06474 (2023)

Deng, Y., Zhang, W., Pan, S.J., Bing, L.: Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474 (2023)

arXiv 2023

[25] [25]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

2023

[26] [26]

arXiv preprint arXiv:2401.05561 (2024)

Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

Pith/arXiv arXiv 2024

[27] [27]

IEEE Software40(3), 4–8 (2023)

Ozkaya, I.: Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software40(3), 4–8 (2023)

2023

[28] [28]

Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

Siino, M., Tinnirello, I., Cascia, M.L.: From foundations to gpt in text clas- sification: A comprehensive survey on current approaches and future trends. Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

2025

[29] [29]

arXiv preprint arXiv:2004.06660 (2020)

Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660 (2020)

arXiv 2004

[30] [30]

In: 2022 IEEE Symposium on Security and Privacy (SP), pp

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914 (2022). IEEE

2022

[31] [31]

arXiv preprint arXiv:2010.12563 (2020)

Wallace, E., Zhao, T.Z., Feng, S., Singh, S.: Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563 (2020)

arXiv 2010

[32] [32]

arXiv preprint arXiv:2410.12855 (2024)

Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H.: Jailjudge: A comprehen- sive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855 (2024)

arXiv 2024

[33] [33]

arXiv preprint arXiv:2310.02446 (2023)

Yong, Z.-X., Menghini, C., Bach, S.H.: Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023)

Pith/arXiv arXiv 2023

[34] [34]

arXiv preprint arXiv:2311.12445 (2023)

Zhao, Y., Wang, L., Chen, X.: Privacy concerns in multilingual language models. arXiv preprint arXiv:2311.12445 (2023)

arXiv 2023

[35] [35]

arXiv preprint arXiv:2311.03348 (2023)

Shah, R., Pour, S., Tagade, A., Casper, S., Rando, J., et al.: Scalable and trans- ferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348 (2023)

arXiv 2023

[36] [36]

arXiv preprint arXiv:2405.21018 (2024)

Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

arXiv 2024

[37] [37]

39 arXiv preprint arXiv:2311.05232 (2023)

Wang, Z., Shen, T., Huang, Z., Lu, H.: A survey on language model hallucination. 39 arXiv preprint arXiv:2311.05232 (2023)

Pith/arXiv arXiv 2023

[38] [38]

In: Proceedings of EMNLP (2023)

Zhang, T., Wang, Y., Chen, H.,et al.: Measuring and mitigating hallucination in summarization. In: Proceedings of EMNLP (2023)

2023

[39] [39]

arXiv preprint arXiv:2312.05209 (2023)

Kim, J.-W., Park, J., Cho, K.: Cross-lingual hallucination detection in large language models. arXiv preprint arXiv:2312.05209 (2023)

arXiv 2023

[40] [40]

In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp

Siino, M.: Brainllama at semeval-2024 task 6: Prompting llama to detect hallu- cinations and related observable overgeneration mistakes. In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp. 82–87 (2024)

2024

[41] [41]

arXiv preprint arXiv:2310.12534 (2023)

Lee, K., Kim, T., Park, C.: Translation-induced hallucination in multilingual models. arXiv preprint arXiv:2310.12534 (2023)

arXiv 2023

[42] [42]

In: Proceedings of ACL (2023)

Chang, M., Henderson, P.,et al.: Cultural alignment and bias in large language models. In: Proceedings of ACL (2023)

2023

[43] [43]

arXiv preprint arXiv:2311.07468 (2023)

Wu, X., Zhang, C., Li, W.: Multilingual consistency in large language models. arXiv preprint arXiv:2311.07468 (2023)

arXiv 2023

[44] [44]

Computational Linguistics50(1), 89–124 (2024)

Zhao, L., Kumar, R.: Cross-cultural semantic preservation in multilingual lan- guage models. Computational Linguistics50(1), 89–124 (2024)

2024

[45] [45]

arXiv preprint arXiv:2310.00905 (2023)

Wang, W., Tu, Z., Chen, C., Yuan, Y., Huang, J.-t., Jiao, W., Lyu, M.R.: All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905 (2023)

arXiv 2023

[46] [46]

In: Findings of ACL (2023)

Wang, Y., Chen, H.,et al.: Measuring factual consistency in large language model outputs. In: Findings of ACL (2023)

2023

[47] [47]

arXiv preprint arXiv:2312.09036 (2023)

Chen, Y., Liu, Y., Zhang, W.: Cross-lingual consistency checking for large language models. arXiv preprint arXiv:2312.09036 (2023)

arXiv 2023

[48] [48]

arXiv preprint arXiv:2311.12024 (2023)

Thompson, S., Chen, D.: A taxonomy of hallucination patterns in large language models. arXiv preprint arXiv:2311.12024 (2023)

arXiv 2023

[49] [49]

arXiv preprint arXiv:2311.09801 (2023)

Liu, X., Zhang, W., et al.: A systematic analysis of jailbreaking and response unfaithfulness in large language models. arXiv preprint arXiv:2311.09801 (2023)

arXiv 2023

[50] [50]

arXiv preprint arXiv:2202.03286 (2022)

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022)

Pith/arXiv arXiv 2022

[51] [51]

arXiv preprint arXiv:2401.16656 (2024) 40

Wichers, N., Denison, C., Beirami, A.: Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656 (2024) 40

arXiv 2024

[52] [52]

In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you ´ ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp. 79–90 (2023)

2023

[53] [53]

arXiv preprint arXiv:2402.08679 (2024)

Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

arXiv 2024

[54] [54]

arXiv preprint arXiv:2311.08268 (2023)

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep´ s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

arXiv 2023

[55] [55]

arXiv preprint arXiv:2406.01288 (2024)

Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.: Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288 (2024)

arXiv 2024

[56] [56]

arXiv preprint arXiv:2407.16667 (2024)

Xu, H., Zhang, W., Wang, Z., Xiao, F., Zheng, R., Feng, Y., Ba, Z., Ren, K.: Redagent: Red teaming large language models with context-aware autonomous language agent. arXiv preprint arXiv:2407.16667 (2024)

arXiv 2024

[57] [57]

arXiv preprint arXiv:2407.03876 (2024)

Jiang, B., Jing, Y., Shen, T., Wu, T., Yang, Q., Xiong, D.: Automated progressive red teaming. arXiv preprint arXiv:2407.03876 (2024)

arXiv 2024

[58] [58]

arXiv preprint arXiv:2301.02344 (2023)

Aghakhani, H., Dai, W., Manoel, A., Fernandes, X., Kharkar, A., Kruegel, C., Vigna, G., Evans, D., Zorn, B., Sim, R.: Trojanpuzzle: Covertly poisoning code- suggestion models. arXiv preprint arXiv:2301.02344 (2023)

arXiv 2023

[59] [59]

SQ u AD : 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1...

work page doi:10.18653/v1/d16-1264 2016

[60] [60]

Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

Guellil, I., Saˆ adane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural lan- guage processing: An overview. Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

2021

[61] [61]

arXiv preprint arXiv:2407.15390 (2024)

Bari, M.S., Alnumay, Y., Alzahrani, N.A., Alotaibi, N.M., Alyahya, H.A., AlRashed, S., Mirza, F.A., Alsubaie, S.Z., Alahmed, H.A., Alabduljabbar, G., et al.: Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390 (2024)

arXiv 2024

[62] [62]

arXiv preprint arXiv:2005.00661 (2020)

Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)

arXiv 2005

[63] [63]

arXiv preprint arXiv:1910.12840 (2019)

Kry´ sci´ nski, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual 41 consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)

arXiv 1910

[64] [64]

arXiv preprint arXiv:2102.09130 (2021)

Nan, F., Nallapati, R., Wang, Z., Santos, C.N.d., Zhu, H., Zhang, D., McKeown, K., Xiang, B.: Entity-level factual consistency of abstractive text summarization. arXiv preprint arXiv:2102.09130 (2021)

arXiv 2021

[65] [65]

Transactions of the Association for Computational Linguistics10, 163–177 (2022)

Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022)

2022

[66] [66]

arXiv preprint arXiv:2210.07197 (2022)

Zhong, M., Liu, Y., Yin, D., Mao, Y., Jiao, Y., Liu, P., Zhu, C., Ji, H., Han, J.: Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022)

arXiv 2022

[67] [67]

arXiv preprint arXiv:2305.13194 (2023)

Clark, E., Rijhwani, S., Gehrmann, S., Maynez, J., Aharoni, R., Nikolaev, V., Sellam, T., Siddhant, A., Das, D., Parikh, A.P.: Seahorse: A multilingual, mul- tifaceted dataset for summarization evaluation. arXiv preprint arXiv:2305.13194 (2023)

arXiv 2023

[68] [68]

arXiv preprint arXiv:2104.13346 (2021)

Pagnoni, A., Balachandran, V., Tsvetkov, Y.: Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346 (2021)

arXiv 2021

[69] [69]

arXiv preprint arXiv:2410.15236 (2024)

Peng, B., Bi, Z., Niu, Q., Liu, M., Feng, P., Wang, T., Yan, L.K., Wen, Y., Zhang, Y., Yin, C.H.: Jailbreaking and mitigation of vulnerabilities in large language models. arXiv preprint arXiv:2410.15236 (2024)

Pith/arXiv arXiv 2024

[70] [70]

arXiv preprint arXiv:2307.15043 (2023)

Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Pith/arXiv arXiv 2023

[71] [71]

arXiv preprint arXiv:2305.13860 (2023)

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

Pith/arXiv arXiv 2023

[72] [72]

arXiv preprint arXiv:2310.08419 (2023)

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jail- breaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023)

Pith/arXiv arXiv 2023

[73] [73]

arXiv preprint arXiv:2404.01318 (2024)

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 (2024)

Pith/arXiv arXiv 2024

[74] [74]

arXiv preprint arXiv:2408.04686 42 (2024)

Sun, X., Zhang, D., Yang, D., Zou, Q., Li, H.: Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686 42 (2024)

arXiv 2024

[75] [75]

arXiv preprint arXiv:2206.07682 (2022)

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)

Pith/arXiv arXiv 2022

[76] [76]

arXiv preprint arXiv:2001.08361 (2020)

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

Pith/arXiv arXiv 2001

[77] [77]

arXiv preprint arXiv:2306.09479 (2023)

McKenzie, I.R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., McLean, E., Kirtland, A., Ross, A., Liu, A., et al.: Inverse scaling: When bigger isn´t better. arXiv preprint arXiv:2306.09479 (2023)

arXiv 2023

[78] [78]

arXiv preprint arXiv:1904.09751 (2019)

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)

Pith/arXiv arXiv 1904

[79] [79]

arXiv preprint arXiv:1909.05858 (2019)

Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: A con- ditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)

Pith/arXiv arXiv 1909

[80] [80]

arXiv preprint arXiv:2306.02561 (2023)

Jiang, D., Ren, X., Lin, B.Y.: Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 (2023)

arXiv 2023