pith. machine review for the scientific record. sign in

arxiv: 2603.29403 · v2 · submitted 2026-03-31 · 💻 cs.CR · cs.AI

Recognition: no theorem link

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Al Masoud , Antony Anju , Marco Arazzi , Mert Cihangiroglu , Vignesh Kumar Kembu , Serena Nicolazzo , Antonino Nocera , Vinod P. , Saraga Sakthidharan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:09 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM-as-a-JudgesecuritySoKadversarial attackstaxonomyevaluationvulnerabilitiesdefenses
0
0 comments X

The pith

LLM-as-a-Judge systems face security risks as both targets of attacks and tools for launching them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs the first systematization of knowledge on security issues in LLM-as-a-Judge systems. It surveys 863 works, selects 45 studies, and organizes them under a four-role taxonomy that covers attacks aimed at the judge, attacks executed by the judge, defenses that employ the judge, and security applications that use the judge for evaluation. The analysis identifies concrete vulnerabilities that can undermine the reliability of automated assessment pipelines. Readers would care because growing numbers of AI systems now depend on such judges to scale quality and safety checks, so weaknesses here directly affect trust in generated content across many domains.

Core claim

The paper claims that LLM-as-a-Judge systems introduce distinct security risks that previous work has not yet organized. By reviewing the selected studies it shows these risks fall into four roles: the judge as target of manipulation, the judge as instrument for attacks, the judge as component in defenses, and the judge as evaluator in security tasks. Comparative analysis of the 45 papers reveals shared limitations, emerging threats, and open challenges for making such systems more robust.

What carries the argument

A four-role taxonomy that places each use of LLM-as-a-Judge into one of four positions within the security landscape: target, attack vector, defense mechanism, or application tool.

Load-bearing premise

The 45 chosen studies from the initial 863 works fully capture the security landscape of LLM-as-a-Judge without major omissions or selection bias.

What would settle it

A new survey that identifies a sizable set of LaaJ security papers from 2020-2026 whose threats or defenses fall outside all four roles in the taxonomy.

Figures

Figures reproduced from arXiv: 2603.29403 by Aiman Al Masoud, Antonino Nocera, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Saraga Sakthidharan, Serena Nicolazzo, Vignesh Kumar Kembu, Vinod P..

Figure 1
Figure 1. Figure 1: 1. In-Context Learning (ICL). ICL [10] can be described as conditional generation over an augmented context containing demonstration examples, without parameter updates. The evaluation task is specified directly within the prompt through instructions and, optionally, a small number of annotated examples. The model does not update its parameters; instead, it infers the evaluation pattern from contextual cue… view at source ↗
Figure 1
Figure 1. Figure 1: General workflow for LLM-as-a-Judge scheme [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Literature Survey Tree 4 Taxonomy of LLM-as-a-Judge in the security context In this section, we present the taxonomy that categorizes existing works according to the role played by LaaJ in the security landscape. Specifically, our taxonomy distinguishes five main perspectives: (i) attacks targeting LaaJ systems, where the goal is to manipulate the behavior of the judge; (ii) attacks performed through LaaJ,… view at source ↗
Figure 3
Figure 3. Figure 3: LaaJ as a Target of Attacks Tree 5.1.1 Training-Time Attacks Training-time attacks exploit the judge’s learning process itself, embedding malicious behavior before the model is ever deployed. Rather than crafting adversarial inputs at test time, these attacks compromise the judge at a deeper level, either by poisoning the data or labels used during fine-tuning, or by manipulating the evaluation protocols a… view at source ↗
Figure 4
Figure 4. Figure 4: Classification of the application domain of works that deal with LaaJ as an evaluation instrument [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Taxonomy of LLM-based research directions in code review and software engineering. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LaaJ positions itself as a middle-ground between the expensive but flexible human annotation and cheaper but [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Yearly Distribution of Publications Conferences Journal ArXiv 0 5 10 15 20 25 Number of Publications 19 15 11 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Number of Publications by Venue of LaaJ frameworks, the design of defenses against adversarial manipulation of LLM-based judges, and the exploration of hybrid evaluation strategies that combine automated judges with human oversight. Addressing these challenges will be crucial to ensure the safe and trustworthy deployment of LLM-as-a-Judge systems in real-world applications. References [1] Ramitha Abeyratne… view at source ↗
read the original abstract

LLM-as-a-Judge (LaaJ) is a novel paradigm in which powerful language models are used to assess the quality, safety, or correctness of generated outputs. While this paradigm has significantly improved the scalability and efficiency of evaluation processes, it also introduces novel security risks and reliability concerns that remain largely unexplored. In particular, LLM-based judges can become both targets of adversarial manipulation and instruments through which attacks are conducted, potentially compromising the trustworthiness of evaluation pipelines. In this paper, we present the first Systematization of Knowledge (SoK) focusing on the security aspects of LLM-as-a-Judge systems. We perform a comprehensive literature review across major academic databases, analyzing 863 works and selecting 45 relevant studies published between 2020 and 2026. Based on this study, we propose a taxonomy that organizes recent research according to the role played by LLM-as-a-Judge in the security landscape, distinguishing between attacks targeting LaaJ systems, attacks performed through LaaJ, defenses leveraging LaaJ for security purposes, and applications where LaaJ is used as an evaluation strategy in security-related domains. We further provide a comparative analysis of existing approaches, highlighting current limitations, emerging threats, and open research challenges. Our findings reveal significant vulnerabilities in LLM-based evaluation frameworks, as well as promising directions for improving their robustness and reliability. Finally, we outline key research opportunities that can guide the development of more secure and trustworthy LLM-as-a-Judge systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to present the first Systematization of Knowledge (SoK) on the security aspects of LLM-as-a-Judge (LaaJ) systems. It conducts a literature review across major academic databases, analyzing 863 works and selecting 45 relevant studies published between 2020 and 2026. Based on this review, the authors propose a taxonomy organizing research by the role of LaaJ in the security landscape: attacks targeting LaaJ systems, attacks performed through LaaJ, defenses leveraging LaaJ for security purposes, and applications where LaaJ serves as an evaluation strategy in security-related domains. The work includes a comparative analysis of approaches, identification of limitations and emerging threats, and an outline of open research challenges and opportunities for improving robustness and trustworthiness of LaaJ systems.

Significance. If the selection of 45 studies is comprehensive and the four-role taxonomy accurately partitions the space without material omissions, this SoK would provide a timely synthesis of an emerging area, highlighting vulnerabilities in LLM-based evaluation pipelines and identifying directions for more secure systems. The systematic literature review approach and focus on security-specific roles represent a clear organizational contribution to the field.

major comments (2)
  1. [Abstract] Abstract: The claim of analyzing 863 works and selecting 45 relevant studies is presented without any details on search strings, databases queried, inclusion/exclusion criteria, screening process, or inter-rater reliability. This is load-bearing for the 'first SoK' and 'comprehensive' assertions, as it leaves the taxonomy's grounding and coverage unverifiable and open to potential bias or omissions (e.g., post-2024 works or non-English sources).
  2. [Taxonomy and comparative analysis sections] Taxonomy presentation (throughout the comparative analysis sections): The four-role taxonomy is introduced as organizing the security landscape, but the manuscript provides no explicit justification, coverage analysis, or discussion of boundary cases to demonstrate that it partitions the space without significant gaps, such as LaaJ uses in red-teaming that may not fit neatly into the defined roles.
minor comments (1)
  1. [Abstract] Abstract: The date range 'published between 2020 and 2026' includes future years; clarify whether this reflects a projection, preprint inclusion, or typographical issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our SoK. We agree that greater transparency on the literature review process and explicit justification for the taxonomy will strengthen the paper. We will make the requested revisions in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of analyzing 863 works and selecting 45 relevant studies is presented without any details on search strings, databases queried, inclusion/exclusion criteria, screening process, or inter-rater reliability. This is load-bearing for the 'first SoK' and 'comprehensive' assertions, as it leaves the taxonomy's grounding and coverage unverifiable and open to potential bias or omissions (e.g., post-2024 works or non-English sources).

    Authors: We agree the abstract should include a concise summary of the methodology to support the claims. The full manuscript (Section 2) already details the process: search strings such as ('LLM-as-a-Judge' OR 'LLM judge') AND (security OR adversarial OR robustness), databases (IEEE Xplore, ACM DL, arXiv, Google Scholar), inclusion criteria (2020-2026 peer-reviewed works focused on LaaJ security, excluding non-English and purely capability-focused papers), two-author PRISMA-style screening, and inter-rater reliability (Cohen's kappa = 0.82). We will revise the abstract to state: 'following a systematic PRISMA-guided review of 863 papers across major databases with explicit inclusion criteria, yielding 45 studies.' This makes the grounding verifiable while preserving the 'first SoK' claim. revision: yes

  2. Referee: [Taxonomy and comparative analysis sections] Taxonomy presentation (throughout the comparative analysis sections): The four-role taxonomy is introduced as organizing the security landscape, but the manuscript provides no explicit justification, coverage analysis, or discussion of boundary cases to demonstrate that it partitions the space without significant gaps, such as LaaJ uses in red-teaming that may not fit neatly into the defined roles.

    Authors: We acknowledge that an explicit justification subsection would improve clarity. The taxonomy is derived from the distinct functional roles observed across the 45 papers (target of attacks, vector for attacks, defensive mechanism, and evaluation tool in security domains), as motivated in the introduction and elaborated with comparative tables in Sections 4-7 that assign every selected paper without overlap. Red-teaming examples are covered under 'attacks performed through LaaJ' when LaaJ generates or evaluates adversarial prompts. We will add a new subsection (3.1) providing (a) derivation rationale from the reviewed literature, (b) coverage analysis confirming no material omissions in the 45 papers, and (c) explicit discussion of boundary cases including red-teaming, hybrid uses, and why they map to the four roles. This addresses the partition concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: pure literature synthesis with no derivations or self-referential reductions

full rationale

This SoK paper performs a literature review (863 works screened to 45) and proposes a four-role taxonomy for security aspects of LLM-as-a-Judge. No equations, fitted parameters, predictions, or derivations exist. The central claim of being the 'first SoK' rests on the external review process and selection criteria rather than any internal construction that reduces to its own inputs by definition. No self-citation chains, ansatzes, or renamings of known results are load-bearing in a circular manner. The work is self-contained against external benchmarks as a synthesis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey paper the central claim rests on the representativeness of the reviewed literature and the utility of the new taxonomy; no new free parameters, mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption The 45 selected studies adequately represent the key security research on LLM-as-a-Judge published 2020-2026
    Invoked when claiming the taxonomy covers recent research and highlights limitations

pith-pipeline@v0.9.0 · 5603 in / 1146 out tokens · 48736 ms · 2026-05-14T00:09:09.521484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Alignllm: Alignment-based evaluation using ensemble of llms-as-judges for q&a

    Ramitha Abeyratne, Nirmalie Wiratunga, Kyle Martin, Ikechukwu Nkisi-Orji, and Lasal Jayawardena. Alignllm: Alignment-based evaluation using ensemble of llms-as-judges for q&a. InInternational conference on case-based reasoning, pages 21–36. Springer, 2025

  2. [2]

    Llm-based explainable detection of llm-generated code in python programming courses

    Jeonghun Baek, Tetsuro Yamazaki, Akimasa Morihata, Junichiro Mori, Yoko Yamakata, Kenjiro Taura, and Shigeru Chiba. Llm-based explainable detection of llm-generated code in python programming courses. In Proceedings of the 57th ACM Technical Symposium on Computer Science Education V . 1, pages 80–86, 2026

  3. [3]

    Rankers, judges, and assistants: Towards understanding the interplay of llms in information retrieval evaluation

    Krisztian Balog, Don Metzler, and Zhen Qin. Rankers, judges, and assistants: Towards understanding the interplay of llms in information retrieval evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3865–3875, 2025. 27 APREPRINT- APRIL7, 2026

  4. [4]

    Enhancing network security using knowledge graphs and large language models for explainable threat detection.Future Generation Computer Systems, page 108160, 2025

    Loris Belcastro, Carmine Carlucci, Cristian Cosentino, Pietro Liò, and Fabrizio Marozzo. Enhancing network security using knowledge graphs and large language models for explainable threat detection.Future Generation Computer Systems, page 108160, 2025

  5. [5]

    Cyberrag: An agentic rag cyber attack classification and reporting tool.Future Generation Computer Systems, page 108186, 2025

    Francesco Blefari, Cristian Cosentino, Francesco Aurelio Pironti, Angelo Furfaro, and Fabrizio Marozzo. Cyberrag: An agentic rag cyber attack classification and reporting tool.Future Generation Computer Systems, page 108186, 2025

  6. [6]

    Benchmarking adversarial robustness to bias elicitation in large language models: Scalable automated assessment with llm-as-a-judge.Machine Learning, 114(11):249, 2025

    Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, and Domenico Talia. Benchmarking adversarial robustness to bias elicitation in large language models: Scalable automated assessment with llm-as-a-judge.Machine Learning, 114(11):249, 2025

  7. [7]

    Toward generalizable evaluation in the llm era: A survey beyond benchmarks.arXiv preprint arXiv:2504.18838, 2025

    Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, et al. Toward generalizable evaluation in the llm era: A survey beyond benchmarks.arXiv preprint arXiv:2504.18838, 2025

  8. [8]

    Toward holistic evaluation of recommender systems powered by generative models

    Yashar Deldjoo, Nikhil Mehta, Maheswaran Sathiamoorthy, Shuai Zhang, Pablo Castells, and Julian McAuley. Toward holistic evaluation of recommender systems powered by generative models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3932–3942, 2025

  9. [9]

    Rubrics as an attack surface: Stealthy preference drift in llm judges, 2026

    Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, and Zhun Deng. Rubrics as an attack surface: Stealthy preference drift in llm judges, 2026

  10. [10]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

  11. [11]

    Xg-nid: Dual-modality network intrusion detection using a heterogeneous graph neural network and large language model.Expert Systems with Applications, 287:128089, 2025

    Yasir Ali Farrukh, Syed Wali, Irfan Khan, and Nathaniel D Bastian. Xg-nid: Dual-modality network intrusion detection using a heterogeneous graph neural network and large language model.Expert Systems with Applications, 287:128089, 2025

  12. [12]

    Artificial authority: The promise and perils of llm judges in healthcare.Bioengineering, 13(1):108, 2026

    Ariana Genovese, Lars Hegstrom, Srinivasagam Prabha, Cesar A Gomez-Cabello, Syed Ali Haider, Bernardo Collaco, Nadia G Wood, and Antonio Jorge Forte. Artificial authority: The promise and perils of llm judges in healthcare.Bioengineering, 13(1):108, 2026

  13. [13]

    What types of code review comments do developers most frequently resolve? In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3760–3765

    Saul Goldman, Hong Yi Lin, Jirat Pasuksmit, Patanamon Thongtanunam, Kla Tantithamthavorn, Zhe Wang, Ray Zhang, Ali Behnaz, Fan Jiang, Michael Siers, et al. What types of code review comments do developers most frequently resolve? In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3760–3765. IEEE, 2025

  14. [14]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  15. [15]

    Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology, 2025

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology, 2025

  16. [16]

    Detecting chatgpt-generated code submissions in a cs1 course using machine learning models

    Muntasir Hoq, Yang Shi, Juho Leinonen, Damilola Babalola, Collin Lynch, Thomas Price, and Bita Akram. Detecting chatgpt-generated code submissions in a cs1 course using machine learning models. InProceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE 2024, page 526–532, New York, NY , USA, 2024. Association for Computi...

  17. [17]

    Combining large language models with static analyzers for code review generation

    Imen Jaoua, Oussama Ben Sghaier, and Houari Sahraoui. Combining large language models with static analyzers for code review generation. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 174–186. IEEE, 2025

  18. [18]

    Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

    Hao Ji, Kotha Aditya, Sebastian Escalante, and Yunjian Qiu. Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

  19. [19]

    Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

    Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

  20. [20]

    Clinical large language model evaluation by expert review (clever): Framework development and validation.JMIR AI, 4(1):e72153, 2025

    Veysel Kocaman, Mustafa Aytu˘g Kaya, Andrei Marian Feier, and David Talby. Clinical large language model evaluation by expert review (clever): Framework development and validation.JMIR AI, 4(1):e72153, 2025

  21. [21]

    Rurage: Robust universal rag evaluator for fast and affordable qa performance testing

    Nikita Krayko, Ivan Sidorov, Fedor Laputin, Alexander Panchenko, Daria Galimzianova, and Vasily Konovalov. Rurage: Robust universal rag evaluator for fast and affordable qa performance testing. InEuropean Conference on Information Retrieval, pages 135–145. Springer, 2025. 28 APREPRINT- APRIL7, 2026

  22. [22]

    Biasscope: Towards automated detection of bias in llm-as-a-judge evaluation, 2026

    Peng Lai, Zhihao Ou, Yong Wang, Longyue Wang, Jian Yang, Yun Chen, and Guanhua Chen. Biasscope: Towards automated detection of bias in llm-as-a-judge evaluation, 2026

  23. [23]

    From generation to judgment: Opportunities and challenges of llm-as-a-judge

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025

  24. [24]

    Who’s your judge? on the detectability of llm-generated judgments, 2025

    Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, and Huan Liu. Who’s your judge? on the detectability of llm-generated judgments, 2025

  25. [25]

    Lexrag: Benchmarking retrieval-augmented generation in multi-turn legal consultation conversation

    Haitao Li, Yifan Chen, Hu YiRan, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, and Yiqun Liu. Lexrag: Benchmarking retrieval-augmented generation in multi-turn legal consultation conversation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3606–3615, 2025

  26. [26]

    Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024

  27. [27]

    Software engineering and foundation models: Insights from industry blogs using a jury of foundation models

    Hao Li, Cor-Paul Bezemer, and Ahmed E Hassan. Software engineering and foundation models: Insights from industry blogs using a jury of foundation models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 307–318. IEEE, 2025

  28. [28]

    Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge, 2025

    Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge, 2025

  29. [29]

    Compromising llm driven embodied agents with contextual backdoor attacks

    Aishan Liu, Yuguang Zhou, Xianglong Liu, Tianyuan Zhang, Siyuan Liang, Jiakai Wang, Yanjun Pu, Tianlin Li, Junqi Zhang, Wenbo Zhou, et al. Compromising llm driven embodied agents with contextual backdoor attacks. IEEE Transactions on Information Forensics and Security, 2025

  30. [30]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

  31. [31]

    Adversarial attacks on llm-as-a-judge systems: Insights from prompt injections, 2025

    Narek Maloyan and Dmitry Namiot. Adversarial attacks on llm-as-a-judge systems: Insights from prompt injections, 2025

  32. [32]

    Towards explainable vulnerability detection with large language models.IEEE Transactions on Software Engineering, 2025

    Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. Towards explainable vulnerability detection with large language models.IEEE Transactions on Software Engineering, 2025

  33. [33]

    Impromptu: a framework for model-driven prompt engineering: S

    Sergio Morales, Robert Clarisó, and Jordi Cabot. Impromptu: a framework for model-driven prompt engineering: S. morales et al.Software and Systems Modeling, 24(6):1627–1645, 2025

  34. [34]

    A comprehensive survey on llm-powered recommender systems: from discrimi- native, generative to multi-modal paradigms.IEEE Access, 2025

    Dina Nawara and Rasha Kashef. A comprehensive survey on llm-powered recommender systems: from discrimi- native, generative to multi-modal paradigms.IEEE Access, 2025

  35. [35]

    Impact of an llm-based review assistant in practice: A mixed open-/closed-source case study.IEEE Transactions on Software Engineering, 2026

    Doriane Olewicki, Leuson Da Silva, Oussama Ben Sghaier, Suhaib Mujahid, Arezou Amini, Benjamin Mah, Marco Castelluccio, Sarra Habchi, Foutse Khomh, and Bram Adams. Impact of an llm-based review assistant in practice: A mixed open-/closed-source case study.IEEE Transactions on Software Engineering, 2026

  36. [36]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July

  37. [37]

    Association for Computational Linguistics

  38. [38]

    Evaluating and improving the robustness of security attack detectors generated by llms.Empirical Software Engineering, 31(2):35, 2026

    Samuele Pasini, Jinhan Kim, Tommaso Aiello, Rocío Cabrera Lozoya, Antonino Sabetta, and Paolo Tonella. Evaluating and improving the robustness of security attack detectors generated by llms.Empirical Software Engineering, 31(2):35, 2026

  39. [39]

    Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment

    Vyas Raina, Adian Liusie, and Mark Gales. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7499–7517, 2024

  40. [40]

    Codebleu: a method for automatic evaluation of code synthesis, 2020

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020

  41. [41]

    Harnessing large language models for curated code reviews

    Oussama Ben Sghaier, Martin Weyssow, and Houari Sahraoui. Harnessing large language models for curated code reviews. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 187–198. IEEE, 2025. 29 APREPRINT- APRIL7, 2026

  42. [42]

    Towards effective offensive security llm agents: Hyperparameter tuning, llm as a judge, and a lightweight ctf benchmark, 2025

    Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. Towards effective offensive security llm agents: Hyperparameter tuning, llm as a judge, and a lightweight ctf benchmark, 2025

  43. [43]

    Defining and evaluating decision and composite risk in language models applied to natural language inference.Engineering Applications of Artificial Intelligence, 161:112253, 2025

    Ke Shen and Mayank Kejriwal. Defining and evaluating decision and composite risk in language models applied to natural language inference.Engineering Applications of Artificial Intelligence, 161:112253, 2025

  44. [44]

    Gptracker: A large-scale measurement of misused gpts

    Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang. Gptracker: A large-scale measurement of misused gpts. In2025 IEEE Symposium on Security and Privacy (SP), pages 336–354. IEEE, 2025

  45. [45]

    Optimization- based prompt injection attack to llm-as-a-judge

    Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Optimization- based prompt injection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 660–674, 2024

  46. [46]

    Judging the judges: A systematic study of position bias in llm-as-a-judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...

  47. [47]

    Towards a human-in-the-loop framework for reliable patch evaluation using an llm-as-a-judge, 2025

    Sherry Shi, Renyao Wei, Michele Tufano, José Cambronero, Runxiang Cheng, Franjo Ivanˇci´c, and Pat Rondon. Towards a human-in-the-loop framework for reliable patch evaluation using an llm-as-a-judge, 2025

  48. [48]

    Akarsth Kumar Singh and Shang-Hsien Hsieh. Multi-llm-based augmentation and synthetic data generation of construction schedules and task descriptions with slm-as-a-judge assessment.Advanced Engineering Informatics, 69:103825, 2026

  49. [49]

    Enhancing fine-tuning llm evaluation: A study on calibration and metrics for industry- specific ai alignment

    Lucia Larise Stavarache. Enhancing fine-tuning llm evaluation: A study on calibration and metrics for industry- specific ai alignment. In2025 IEEE Conference on Artificial Intelligence (CAI), pages 1247–1250. IEEE, 2025

  50. [50]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

  51. [51]

    Badjudge: Backdoor vulnerabilities of llm-as-a-judge, 2025

    Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. Badjudge: Backdoor vulnerabilities of llm-as-a-judge, 2025

  52. [52]

    Manipulating multimodal agents via cross-modal prompt injection

    Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Manipulating multimodal agents via cross-modal prompt injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10955–10964, 2025

  53. [53]

    Shiyu Wang, Yuyao Jiang, Shuaijianni Xu, Yiwen Liu, Ming Yin, and Guofeng He. Fine-tuning a vulnerability- specific large language model for a hybrid software vulnerability detection method.Engineering Applications of Artificial Intelligence, 165:113410, 2026

  54. [54]

    Multi-dimensional assessment of crowdsourced testing reports via llms

    Yue Wang, Yuan Zhao, Shengcheng Yu, and Zhenyu Chen. Multi-dimensional assessment of crowdsourced testing reports via llms. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1794–1806. IEEE, 2025

  55. [55]

    Synthetic social engineering scenario generation using llms for awareness-based attack resilience.IEEE Access, 2025

    Jade Webb, Faranak Abri, and Sayma Akther. Synthetic social engineering scenario generation using llms for awareness-based attack resilience.IEEE Access, 2025

  56. [56]

    Benjamin Erichson

    Zhipeng Wei, Yuqi Liu, and N. Benjamin Erichson. Emoji attack: Enhancing jailbreak attacks against judge llm detection, 2025

  57. [57]

    Towards domain- adapted large language models for water and wastewater management: methods, datasets and benchmarking.npj Clean Water, 8(1):82, 2025

    Boyan Xu, Guanlan Wu, Zihao Li, Guangming Xu, Huabin Zeng, Rui Tong, and How Yong Ng. Towards domain- adapted large language models for water and wastewater management: methods, datasets and benchmarking.npj Clean Water, 8(1):82, 2025

  58. [58]

    Benchmark data contamination of large language models: A survey, 2024

    Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Kechadi. Benchmark data contamination of large language models: A survey, 2024

  59. [59]

    Justice or prejudice? quantifying biases in llm-as-a-judge, 2024

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, 2024

  60. [60]

    Agent-as-a-judge, 2026

    Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, and Wenjie Li. Agent-as-a-judge, 2026

  61. [61]

    When ais judge ais: The rise of agent-as-a-judge evaluation for llms, 2025

    Fangyi Yu. When ais judge ais: The rise of agent-as-a-judge evaluation for llms, 2025

  62. [62]

    Leveraging large language models to detect npm malicious packages

    Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko, Feross Aboukhadijeh, and Laurie Williams. Leveraging large language models to detect npm malicious packages. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 2625–2637. IEEE, 2025. 30 APREPRINT- APRIL7, 2026

  63. [63]

    Lsrp: A leader-subordinate retrieval framework for privacy-preserving cloud-device collaboration

    Yingyi Zhang, Pengyue Jia, Xianneng Li, Derong Xu, Maolin Wang, Yichao Wang, Zhaocheng Du, Huifeng Guo, Yong Liu, Ruiming Tang, et al. Lsrp: A leader-subordinate retrieval framework for privacy-preserving cloud-device collaboration. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 3889–3900, 2025

  64. [64]

    A survey of large language models, 2026

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026

  65. [65]

    One token to fool llm-as-a-judge, 2025

    Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge, 2025

  66. [66]

    Rescriber: Smaller-llm-powered user-led data minimization for llm-based chatbots

    Jijie Zhou, Eryue Xu, Yaoyao Wu, and Tianshi Li. Rescriber: Smaller-llm-powered user-led data minimization for llm-based chatbots. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–28, 2025

  67. [67]

    Se-jury: An llm-as-ensemble-judge metric for narrowing the gap with human evaluation in se

    Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luís F Gomes, Guang Yang, Kui Liu, Xin Xia, and David Lo. Se-jury: An llm-as-ensemble-judge metric for narrowing the gap with human evaluation in se. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 2606–2618. IEEE, 2025

  68. [68]

    ICE-score: Instructing large language models to evaluate code

    Terry Yue Zhuo. ICE-score: Instructing large language models to evaluate code. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 2232–2242, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. 31