pith. sign in

arxiv: 2606.25476 · v1 · pith:BEKQ5EY6new · submitted 2026-06-24 · 💻 cs.CL · cs.AI

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Pith reviewed 2026-06-25 20:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords red teamingLLM faithfulnessadversarial promptsmulti-role architecturequestion answeringsummarizationmodel safetyvulnerability evaluation
0
0 comments X

The pith

A three-model red teaming setup with attacker and jury roles exposes up to 7.9% more unfaithful LLM responses in question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a red teaming framework that uses three coordinated language models to find weaknesses in how LLMs stay faithful to given information. One model answers questions or summarizes text, a second generates adversarial prompts to trick it, and a third judges whether the answers stay accurate and consistent. In tests this raised the rate of successful attacks by as much as 7.9 percent on question-answering tasks and worked on both English and Arabic material. The results also indicate that how a model is built matters more for safety than simply increasing its size.

Core claim

The central claim is that a multi-role architecture of target, attacker, and jury models can systematically uncover vulnerabilities in LLM faithfulness, with exploitative adversarial prompts increasing attack success rate by up to 7.9% in QA tasks, and that design choices outweigh parameter scaling for model safety. The framework adapts across tasks and languages while revealing how output-format constraints affect vulnerability patterns.

What carries the argument

The multi-role architecture with target model (generates responses), attacker model (creates adversarial prompts), and jury model (evaluates accuracy and consistency).

If this is right

  • Exploitative adversarial prompts increase detected unfaithfulness by up to 7.9% in question-answering tasks.
  • Format limitations in summarization tasks produce measurable gains in faithfulness.
  • Architectural design choices typically outweigh parameter scaling in determining model safety.
  • The framework enables direct comparison of vulnerabilities across English question-answering and Arabic summarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attacker-jury loop could be run repeatedly on an updated model to track whether specific vulnerabilities persist after retraining.
  • Extending the jury to detect subtle inconsistencies beyond explicit contradictions would require additional evaluation signals.
  • The observed advantage of architecture over scale suggests that safety testing should prioritize controlled architectural variants rather than larger models alone.

Load-bearing premise

The jury model can rigorously and unbiasedly evaluate response accuracy and consistency across tasks and languages.

What would settle it

Human raters scoring the same set of target-model responses for faithfulness produce attack-success rates that differ substantially from the jury model's rates.

read the original abstract

Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising target, attacker, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing weaknesses in reliability. The approach identifies how structural constraints in summarization can shape vulnerability patterns, with format limitations yielding measurable gains in faithfulness, and shows that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength is its adaptability across evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While it excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating adversarial prompt generation across languages. Our experiments also reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly across linguistic contexts. Overall, this architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models evolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a red teaming framework for LLMs using a multi-role architecture (target, attacker, and jury models) to generate adversarial prompts and evaluate faithfulness. In a case study on question-answering and summarization tasks (including cross-lingual Arabic), it reports that exploitative adversarial prompts raise attack success rate by up to 7.9% in QA, that format constraints in summarization improve faithfulness, and that architectural design choices outweigh parameter scaling for model safety. The framework is presented as adaptable across tasks and languages but notes challenges in full automation and detecting subtle unfaithfulness.

Significance. If the jury-based measurements prove reliable, the work would offer a practical, extensible methodology for systematic vulnerability discovery in LLMs, with concrete evidence that prompt exploitation and structural constraints affect faithfulness more than scale alone. This could inform safety evaluation practices, especially for cross-lingual settings.

major comments (2)
  1. [Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.
  2. [Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the concrete LLMs assigned to the target, attacker, and jury roles and stated the number of prompts or examples per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, focusing on the evaluation methodology and abstract claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.

    Authors: We agree the reported 7.9% ASR increase and related conclusions rest on jury-model verdicts. The abstract already flags limitations in detecting subtle unfaithfulness and cross-lingual issues, and the framework is explicitly designed around automated jury evaluation rather than human annotation. No human agreement study or ablation is present because the work focuses on the multi-role automated pipeline. We will revise the manuscript to add an explicit statement in the abstract and a dedicated paragraph in the limitations section clarifying that all quantitative results derive from jury verdicts and discussing this as a methodological choice. revision: partial

  2. Referee: [Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.

    Authors: The claim is grounded in the experimental comparisons across model families presented in the results section. We will revise the abstract to include brief references to the specific model pairs and scale ranges used, while noting that the verdicts come from the jury component of the framework. Perfect isolation of architecture from scale is inherently difficult, but our controlled task setups hold other variables fixed; we will add a short clarifying sentence to this effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experimental outcomes

full rationale

The paper describes an empirical red-teaming framework and case study reporting attack success rates (e.g., up to 7.9% increase) measured via a jury model. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. The central quantitative claims derive from reported experimental results rather than reducing to inputs by construction. The paper explicitly notes limitations in the jury (subtle unfaithfulness, cross-lingual cases), which is consistent with an externally falsifiable measurement approach rather than circularity. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the assumption that LLM-based attackers and juries can be used effectively for evaluation; no free parameters or invented physical entities are present.

axioms (2)
  • domain assumption LLM-based attacker models can generate increasingly effective adversarial prompts that expose unfaithfulness
    This is required for the reported increase in attack success rate.
  • domain assumption Jury models provide an objective measure of accuracy and consistency
    This underpins the measurement of the 7.9% gain and cross-task comparisons.
invented entities (1)
  • Multi-role architecture comprising target, attacker, and jury models no independent evidence
    purpose: To systematically uncover vulnerabilities in LLM outputs
    Presented as the core novel component of the framework

pith-pipeline@v0.9.1-grok · 5794 in / 1459 out tokens · 26607 ms · 2026-06-25T20:56:51.246391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 4 canonical work pages

  1. [1]

    arXiv preprint arXiv:2303.08774 (2023)

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    arXiv preprint arXiv:2401.02954 (2024)

    Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

  3. [3]

    Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems36(2023)

  4. [4]

    In: Muresan, S., Nakov, P., Villavicencio, A

    Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 3419–3448 (2022) https://doi.org/10.18653/v1/2022. emnlp-main.225

  5. [5]

    arXiv preprint arXiv:2406.11036 (2024)

    Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., Inie, N.: garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036 (2024)

  6. [6]

    arXiv preprint arXiv:2209.07858 (2022)

    Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al.: Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022)

  7. [7]

    USENIX Security Symposium (2020)

    Carlini, N., Tram` er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D., Erlingsson, ´U., Oprea, A., Raffel, C.: Extract- ing training data from large language models. USENIX Security Symposium (2020)

  8. [8]

    Advances in Neural Information Processing Systems37, 33402–33422 (2024)

    Wu, K., Wu, E., Zou, J.Y.: Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems37, 33402–33422 (2024)

  9. [9]

    arXiv preprint arXiv:2309.05922 (2023)

    Evans, R., Gao, L., Zhang, W., et al.: Hallucination in large language models: A survey of detection, attribution, and mitigation. arXiv preprint arXiv:2309.05922 (2023)

  10. [10]

    arXiv preprint arXiv:2311.03274 (2023)

    Zhou, L., Zhang, N., Wang, Y.: Detecting hallucinated content in large language model outputs. arXiv preprint arXiv:2311.03274 (2023)

  11. [11]

    Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

    Jabbar, M.S., Al-Azani, S., Alotaibi, A., Ahmed, M.: Red teaming large language models: A comprehensive review and critical analysis. Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37

  12. [12]

    arXiv preprint arXiv:2306.11507 (2023)

    Lin, Y., Hou, Y., Li, C., Gu, Y., Feng, C., Chen, W., Wang, W.: Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023)

  13. [13]

    Journal of Artificial Intelligence Research75, 45–78 (2024)

    Ahmed, M., Ali, H.: Challenges in arabic-english cross-lingual language models. Journal of Artificial Intelligence Research75, 45–78 (2024)

  14. [14]

    Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications (2023)

  15. [15]

    Weight Poisoning Attacks on Pretrained Models

    Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained mod- els. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2793–2806 (2020) https://doi.org/10.18653/v1/2020.acl-main.249

  16. [16]

    8th International Conference on Learning Representations, ICLR 2020 (2019)

    Krishna, K., Tomar, G.S., Parikh, A.P., Papernot, N., Iyyer, M.: Thieves on sesame street! model extraction of bert-based apis. 8th International Conference on Learning Representations, ICLR 2020 (2019)

  17. [17]

    Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study (2023)

  18. [18]

    arXiv preprint arXiv:2310.08859 (2023)

    Wang, Z., Li, C., Zhang, T.: Multilingual security challenges in large language models. arXiv preprint arXiv:2310.08859 (2023)

  19. [19]

    arXiv preprint arXiv:2310.15140 (2023)

    Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., Sun, T.: Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140 (2023)

  20. [20]

    arXiv preprint arXiv:2312.02119 (2023)

    Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119 (2023)

  21. [21]

    arXiv preprint arXiv:2311.07689 (2023)

    Ge, S., Zhou, C., Hou, R., Khabsa, M., Wang, Y.-C., Wang, Q., Han, J., Mao, Y.: Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023)

  22. [22]

    arXiv preprint arXiv:2410.01606 (2024)

    Pavlova, M., Brinkman, E., Iyer, K., Albiero, V., Bitton, J., Nguyen, H., Li, J., Ferrer, C.C., Evtimov, I., Grattafiori, A.: Automated red teaming with goat: the generative offensive agent tester. arXiv preprint arXiv:2410.01606 (2024)

  23. [23]

    arXiv preprint arXiv:2402.04249 (2024) 38

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024) 38

  24. [24]

    arXiv preprint arXiv:2310.06474 (2023)

    Deng, Y., Zhang, W., Pan, S.J., Bing, L.: Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474 (2023)

  25. [25]

    Journal of Machine Learning Research24(240), 1–113 (2023)

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

  26. [26]

    arXiv preprint arXiv:2401.05561 (2024)

    Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

  27. [27]

    IEEE Software40(3), 4–8 (2023)

    Ozkaya, I.: Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software40(3), 4–8 (2023)

  28. [28]

    Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

    Siino, M., Tinnirello, I., Cascia, M.L.: From foundations to gpt in text clas- sification: A comprehensive survey on current approaches and future trends. Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)

  29. [29]

    arXiv preprint arXiv:2004.06660 (2020)

    Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660 (2020)

  30. [30]

    In: 2022 IEEE Symposium on Security and Privacy (SP), pp

    Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914 (2022). IEEE

  31. [31]

    arXiv preprint arXiv:2010.12563 (2020)

    Wallace, E., Zhao, T.Z., Feng, S., Singh, S.: Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563 (2020)

  32. [32]

    arXiv preprint arXiv:2410.12855 (2024)

    Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H.: Jailjudge: A comprehen- sive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855 (2024)

  33. [33]

    arXiv preprint arXiv:2310.02446 (2023)

    Yong, Z.-X., Menghini, C., Bach, S.H.: Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023)

  34. [34]

    arXiv preprint arXiv:2311.12445 (2023)

    Zhao, Y., Wang, L., Chen, X.: Privacy concerns in multilingual language models. arXiv preprint arXiv:2311.12445 (2023)

  35. [35]

    arXiv preprint arXiv:2311.03348 (2023)

    Shah, R., Pour, S., Tagade, A., Casper, S., Rando, J., et al.: Scalable and trans- ferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348 (2023)

  36. [36]

    arXiv preprint arXiv:2405.21018 (2024)

    Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

  37. [37]

    39 arXiv preprint arXiv:2311.05232 (2023)

    Wang, Z., Shen, T., Huang, Z., Lu, H.: A survey on language model hallucination. 39 arXiv preprint arXiv:2311.05232 (2023)

  38. [38]

    In: Proceedings of EMNLP (2023)

    Zhang, T., Wang, Y., Chen, H.,et al.: Measuring and mitigating hallucination in summarization. In: Proceedings of EMNLP (2023)

  39. [39]

    arXiv preprint arXiv:2312.05209 (2023)

    Kim, J.-W., Park, J., Cho, K.: Cross-lingual hallucination detection in large language models. arXiv preprint arXiv:2312.05209 (2023)

  40. [40]

    In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp

    Siino, M.: Brainllama at semeval-2024 task 6: Prompting llama to detect hallu- cinations and related observable overgeneration mistakes. In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp. 82–87 (2024)

  41. [41]

    arXiv preprint arXiv:2310.12534 (2023)

    Lee, K., Kim, T., Park, C.: Translation-induced hallucination in multilingual models. arXiv preprint arXiv:2310.12534 (2023)

  42. [42]

    In: Proceedings of ACL (2023)

    Chang, M., Henderson, P.,et al.: Cultural alignment and bias in large language models. In: Proceedings of ACL (2023)

  43. [43]

    arXiv preprint arXiv:2311.07468 (2023)

    Wu, X., Zhang, C., Li, W.: Multilingual consistency in large language models. arXiv preprint arXiv:2311.07468 (2023)

  44. [44]

    Computational Linguistics50(1), 89–124 (2024)

    Zhao, L., Kumar, R.: Cross-cultural semantic preservation in multilingual lan- guage models. Computational Linguistics50(1), 89–124 (2024)

  45. [45]

    arXiv preprint arXiv:2310.00905 (2023)

    Wang, W., Tu, Z., Chen, C., Yuan, Y., Huang, J.-t., Jiao, W., Lyu, M.R.: All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905 (2023)

  46. [46]

    In: Findings of ACL (2023)

    Wang, Y., Chen, H.,et al.: Measuring factual consistency in large language model outputs. In: Findings of ACL (2023)

  47. [47]

    arXiv preprint arXiv:2312.09036 (2023)

    Chen, Y., Liu, Y., Zhang, W.: Cross-lingual consistency checking for large language models. arXiv preprint arXiv:2312.09036 (2023)

  48. [48]

    arXiv preprint arXiv:2311.12024 (2023)

    Thompson, S., Chen, D.: A taxonomy of hallucination patterns in large language models. arXiv preprint arXiv:2311.12024 (2023)

  49. [49]

    arXiv preprint arXiv:2311.09801 (2023)

    Liu, X., Zhang, W., et al.: A systematic analysis of jailbreaking and response unfaithfulness in large language models. arXiv preprint arXiv:2311.09801 (2023)

  50. [50]

    arXiv preprint arXiv:2202.03286 (2022)

    Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022)

  51. [51]

    arXiv preprint arXiv:2401.16656 (2024) 40

    Wichers, N., Denison, C., Beirami, A.: Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656 (2024) 40

  52. [52]

    In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp

    Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you ´ ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp. 79–90 (2023)

  53. [53]

    arXiv preprint arXiv:2402.08679 (2024)

    Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

  54. [54]

    arXiv preprint arXiv:2311.08268 (2023)

    Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep´ s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

  55. [55]

    arXiv preprint arXiv:2406.01288 (2024)

    Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.: Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288 (2024)

  56. [56]

    arXiv preprint arXiv:2407.16667 (2024)

    Xu, H., Zhang, W., Wang, Z., Xiao, F., Zheng, R., Feng, Y., Ba, Z., Ren, K.: Redagent: Red teaming large language models with context-aware autonomous language agent. arXiv preprint arXiv:2407.16667 (2024)

  57. [57]

    arXiv preprint arXiv:2407.03876 (2024)

    Jiang, B., Jing, Y., Shen, T., Wu, T., Yang, Q., Xiong, D.: Automated progressive red teaming. arXiv preprint arXiv:2407.03876 (2024)

  58. [58]

    arXiv preprint arXiv:2301.02344 (2023)

    Aghakhani, H., Dai, W., Manoel, A., Fernandes, X., Kharkar, A., Kruegel, C., Vigna, G., Evans, D., Zorn, B., Sim, R.: Trojanpuzzle: Covertly poisoning code- suggestion models. arXiv preprint arXiv:2301.02344 (2023)

  59. [59]

    SQ u AD : 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1...

  60. [60]

    Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

    Guellil, I., Saˆ adane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural lan- guage processing: An overview. Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)

  61. [61]

    arXiv preprint arXiv:2407.15390 (2024)

    Bari, M.S., Alnumay, Y., Alzahrani, N.A., Alotaibi, N.M., Alyahya, H.A., AlRashed, S., Mirza, F.A., Alsubaie, S.Z., Alahmed, H.A., Alabduljabbar, G., et al.: Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390 (2024)

  62. [62]

    arXiv preprint arXiv:2005.00661 (2020)

    Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)

  63. [63]

    arXiv preprint arXiv:1910.12840 (2019)

    Kry´ sci´ nski, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual 41 consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)

  64. [64]

    arXiv preprint arXiv:2102.09130 (2021)

    Nan, F., Nallapati, R., Wang, Z., Santos, C.N.d., Zhu, H., Zhang, D., McKeown, K., Xiang, B.: Entity-level factual consistency of abstractive text summarization. arXiv preprint arXiv:2102.09130 (2021)

  65. [65]

    Transactions of the Association for Computational Linguistics10, 163–177 (2022)

    Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022)

  66. [66]

    arXiv preprint arXiv:2210.07197 (2022)

    Zhong, M., Liu, Y., Yin, D., Mao, Y., Jiao, Y., Liu, P., Zhu, C., Ji, H., Han, J.: Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022)

  67. [67]

    arXiv preprint arXiv:2305.13194 (2023)

    Clark, E., Rijhwani, S., Gehrmann, S., Maynez, J., Aharoni, R., Nikolaev, V., Sellam, T., Siddhant, A., Das, D., Parikh, A.P.: Seahorse: A multilingual, mul- tifaceted dataset for summarization evaluation. arXiv preprint arXiv:2305.13194 (2023)

  68. [68]

    arXiv preprint arXiv:2104.13346 (2021)

    Pagnoni, A., Balachandran, V., Tsvetkov, Y.: Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346 (2021)

  69. [69]

    arXiv preprint arXiv:2410.15236 (2024)

    Peng, B., Bi, Z., Niu, Q., Liu, M., Feng, P., Wang, T., Yan, L.K., Wen, Y., Zhang, Y., Yin, C.H.: Jailbreaking and mitigation of vulnerabilities in large language models. arXiv preprint arXiv:2410.15236 (2024)

  70. [70]

    arXiv preprint arXiv:2307.15043 (2023)

    Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

  71. [71]

    arXiv preprint arXiv:2305.13860 (2023)

    Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

  72. [72]

    arXiv preprint arXiv:2310.08419 (2023)

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jail- breaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023)

  73. [73]

    arXiv preprint arXiv:2404.01318 (2024)

    Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 (2024)

  74. [74]

    arXiv preprint arXiv:2408.04686 42 (2024)

    Sun, X., Zhang, D., Yang, D., Zou, Q., Li, H.: Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686 42 (2024)

  75. [75]

    arXiv preprint arXiv:2206.07682 (2022)

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)

  76. [76]

    arXiv preprint arXiv:2001.08361 (2020)

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  77. [77]

    arXiv preprint arXiv:2306.09479 (2023)

    McKenzie, I.R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., McLean, E., Kirtland, A., Ross, A., Liu, A., et al.: Inverse scaling: When bigger isn´t better. arXiv preprint arXiv:2306.09479 (2023)

  78. [78]

    arXiv preprint arXiv:1904.09751 (2019)

    Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)

  79. [79]

    arXiv preprint arXiv:1909.05858 (2019)

    Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: A con- ditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)

  80. [80]

    arXiv preprint arXiv:2306.02561 (2023)

    Jiang, D., Ren, X., Lin, B.Y.: Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 (2023)

Showing first 80 references.