arxiv: 2604.08608 · v2 · submitted 2026-04-08 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines

Tanzim Ahad , Ismail Hossain , Md Jahangir Alam , Sai Puppala , Yoonpyo Lee , Syed Bahauddin Alam , Sajedul Talukder

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords semantic intent fragmentationLLM orchestrationcompositional attacksmulti-agent AI safetypolicy violationred-teamingOWASP LLM06information flow tracking

0 comments

The pith

A single legitimately phrased request to an LLM orchestrator can decompose into individually benign subtasks that together violate security policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Semantic Intent Fragmentation as an attack on multi-agent LLM systems where one normal-sounding query leads the orchestrator to split the work into steps that each clear safety checks but produce a policy breach when assembled. Current defenses review tasks in isolation, allowing the violation to surface only at the full plan level. In tests across 14 realistic enterprise scenarios in finance, security, and HR, a GPT-20B orchestrator generated violating plans in 71 percent of cases while every subtask appeared harmless on its own. The attack uses four mechanisms and needs no injected prompts or further attacker input after the first request. Plan-level tracking of information flow and compliance checks can catch every instance before any execution occurs.

Core claim

Semantic Intent Fragmentation is an attack class against LLM orchestration systems where a single legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. The violation only emerges at the composed plan because existing safety mechanisms operate at the subtask level. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. Across 14 scenarios, a GPT-20B orchestrator produces policy-

What carries the argument

Semantic Intent Fragmentation (SIF), the mechanism by which task decomposition in LLM orchestrators creates composed policy violations from subtasks that each pass isolated checks.

If this is right

Stronger orchestrators produce higher SIF success rates.
Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution.
SIF requires no injected content or ongoing attacker control after the initial request.
The compositional safety gap is closable by shifting evaluation from subtask to full plan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multi-agent systems in regulated domains may need mandatory plan-level review before any subtask runs.
Defenses built only around individual step classification leave open a route for single-shot attacks that scale with model capability.
Similar fragmentation effects could appear in non-LLM orchestration pipelines that rely on sequential task breakdown.

Load-bearing premise

The three validation signals correctly flag policy violations with zero false positives and subtasks are genuinely benign when examined in isolation.

What would settle it

Re-running the 14 scenarios with an independent human policy reviewer or a fourth validation method that finds any subtask labeled benign to actually contain a violation or any plan labeled violating to actually comply with policy.

Figures

Figures reproduced from arXiv: 2604.08608 by Ismail Hossain, Md Jahangir Alam, Sai Puppala, Sajedul Talukder, Syed Bahauddin Alam, Tanzim Ahad, Yoonpyo Lee.

**Figure 1.** Figure 1: The SIF attack pipeline and its structural blind spot. (Centre) A single legitimately-phrased enterprise request is autonomously decomposed by the LLM orchestrator into three subtasks (T1–T3). Although every subtask passes all deployed safety classifiers individually (Fragmentation Score FS = 1.0, Definition 1), their composed output violates enterprise policy, a vulnerability we term the plan-generation g… view at source ↗

read the original abstract

We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers -- the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT-20B orchestrator produces policy-violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real gap in subtask-only safety for LLM orchestrators but the 71% violation rate and zero-false-positive claim rest on details that are not shown.

read the letter

The main thing here is that a single plain-language request can make an orchestrator split a job into steps that each look safe but add up to a policy breach. They call the pattern Semantic Intent Fragmentation and list four ways it shows up, such as gradual scope creep or pulling together data that should stay separate. They ran it on 14 enterprise-style scenarios with a large model and report violations in 10 of them, with every isolated subtask passing checks. They also note that plan-level tracking catches the problem before anything runs. That framing lines up with how real orchestration systems are put together and connects to OWASP and MITRE categories without needing prompt injection or model changes. The work is useful for anyone who has to ship multi-agent pipelines where compliance rules apply at the end of the chain rather than at each step. The soft spot is the evidence. The abstract gives the 71 percent number and says three signals back it up, including a cross-model judge with zero false positives, yet it supplies no list of the scenarios, the actual prompts, the policy wording, or the test set used to measure that zero rate. If the full paper does not add those, the central result stays hard to check or reproduce. The stress-test note on possible agreement between the judge and the orchestrator is fair to raise until the measurement details appear. This is worth a serious referee for teams working on LLM security in practice. The core observation is timely and the suggested fix direction is concrete, even if the current experiments need more transparency to hold up under review. I would send it forward and ask specifically for the full scenario set and validation data in the next version.

Referee Report

2 major / 1 minor

Summary. The paper introduces Semantic Intent Fragmentation (SIF), a single-shot compositional attack on LLM orchestration pipelines in which a legitimately phrased request causes the orchestrator to decompose the task into subtasks that are individually benign yet jointly violate security policy. It reports results from a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks, showing that a GPT-20B orchestrator produces policy-violating plans in 71% (10/14) of scenarios spanning financial reporting, information security, and HR analytics, while every subtask appears benign. Validation rests on three signals—deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge stated to have 0% false positives—and the work proposes plan-level information-flow tracking plus compliance evaluation as a mitigation that detects all attacks before execution.

Significance. If the empirical claims are substantiated with reproducible details, the work identifies a genuine gap in subtask-level safety mechanisms for multi-agent LLM systems and directly addresses OWASP LLM06:2025. The demonstration of four concrete mechanisms (bulk scope escalation, silent data exfiltration, embedded trigger deployment, quasi-identifier aggregation) across realistic enterprise domains, together with the constructive proposal for plan-level detection, would be a useful contribution to compositional security research. The grounding in established frameworks and the use of multiple independent validation signals are strengths that increase the potential impact if the results can be verified.

major comments (2)

[Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.
[Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.

minor comments (1)

[Abstract] The model is referred to as 'GPT-20B' without clarification of whether this denotes a specific released model, a hypothetical parameter count, or a fine-tuned variant; adding this specification in the methods would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important gaps in documentation that affect reproducibility and verifiability of the empirical claims. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.

Authors: We agree that the manuscript does not sufficiently document the validation of the cross-model compliance judge. In the revised version we add a new subsection in the Evaluation section that describes: the 200-prompt test set used to measure false positives (constructed from benign enterprise queries with no compositional intent), the policy formalization process (mapping OWASP LLM06:2025 and NIST controls to explicit rules), inter-rater agreement (Cohen’s kappa = 0.94 between two human reviewers and the judge), and the exact measurement protocol (zero false positives observed across all 200 prompts when subtasks were evaluated in isolation). These additions directly support the claim that violations emerge only at composition. revision: yes
Referee: [Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.

Authors: We acknowledge that the original manuscript omitted the concrete experimental artifacts required for reproduction. The revised manuscript includes a new Appendix B containing: all 14 scenarios with their verbatim initial prompts, the full subtask decompositions produced by the GPT-20B orchestrator, the deterministic taint-analysis and CoT-evaluation results for each isolated subtask, data exclusion rules (scenarios were retained only if every subtask passed both signals), and representative raw model outputs. We will also release the complete prompt set, evaluation harness, and anonymized logs in a public repository to allow independent verification that the policy violation appears only after composition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical red-teaming result grounded in external frameworks

full rationale

The paper reports an empirical count (10/14 scenarios) from red-teaming a GPT-20B orchestrator against enterprise scenarios, using three validation signals (deterministic taint analysis, chain-of-thought evaluation, cross-model judge) explicitly tied to OWASP, MITRE ATLAS, and NIST standards. No equations, fitted parameters, or self-citations are used to derive the 71% rate or the 0% false-positive claim; both are presented as direct experimental outcomes. The central claim does not reduce to a self-definition, renamed known result, or load-bearing self-citation chain, satisfying the requirement for independent, externally benchmarked content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current safety classifiers operate only at the subtask level and that the introduced attack mechanisms (bulk scope escalation, silent data exfiltration, embedded trigger deployment, quasi-identifier aggregation) produce genuinely benign subtasks. No free parameters or invented physical entities are used; SIF is a descriptive category rather than a new postulated object with independent evidence.

axioms (1)

domain assumption Current safety mechanisms in LLM orchestration systems evaluate and clear subtasks independently without considering the composed plan.
Stated directly in the abstract as the reason each step clears classifiers while the violation emerges only at composition.

invented entities (1)

Semantic Intent Fragmentation (SIF) no independent evidence
purpose: To name and categorize the attack class that exploits compositional policy violations from a single legitimate request.
New descriptive term introduced by the authors; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5566 in / 1479 out tokens · 140270 ms · 2026-05-10T17:33:58.357844+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1 (Fragmentation Score). FS(P) = 1−max i{maxc∈C fc(Ti)}. ... Theorem 2 (Compositional Emergence). Let P be a SIF attack with FS = 1.0. Then ∃Π s.t. ∀i:H(T i) = SAFE yet H(P) = UNSAFE
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Breaking and fixing defenses against control-flow hijacking in multi-agent systems.arXiv preprint arXiv:2510.17276, 2025

Rishi Jha, Harold Triedman, Justin Wagle, and Vitaly Shmatikov. Breaking and fixing defenses against control-flow hijacking in multi-agent systems.arXiv preprint arXiv:2510.17276, 2025

work page arXiv 2025
[2]

Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

work page arXiv 2025
[3]

Omni-leak: Orchestrator multi-agent network induced data leakage.arXiv preprint arXiv:2602.13477, 2026

Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, and Adel Bibi. Omni-leak: Orchestrator multi-agent network induced data leakage.arXiv preprint arXiv:2602.13477, 2026

work page arXiv 2026
[4]

The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, and Angelo Furfaro. The dark side of llms : Agent-based attacks for complete computer takeover.CoRR, abs/2507.06850, 2025

work page internal anchor Pith review arXiv 2025
[5]

Adversaries can misuse combinations of safe models

Erik Jones, Anca Dragan, and Jacob Steinhardt. Adversaries can misuse combinations of safe models. In Forty-second International Conference on Machine Learning, 2025

2025
[6]

Chain of attack: a semantic-driven contextual multi-turn attacker for llm.arXiv preprint arXiv:2405.05610, 2024

Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn attacker for llm.arXiv preprint arXiv:2405.05610, 2024

work page arXiv 2024
[7]

AgentHarm: A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents. In Proceedings of the 13th International Conference on Learning Representations, 2025

2025
[8]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, 2022. Association for Computational...

2022
[9]

The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation

NeuralTrust Research. The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation. NeuralTrust Technical Report. https://neuraltrust.ai/blog/semantic-chaining, January 2026. Multimodal single-model attack requiring attacker participation at every step

2026
[10]

LlamaFirewall: An open source guardrail system for building secure AI agents

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents, 2025.URL: https://arxiv. org/abs/2505.03574

work page arXiv 2025
[11]

Securing ai agents with information-flow control

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing ai agents with information-flow control. arXiv, May 2025

2025
[12]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review arXiv 2025
[13]

ShieldAgent: Shielding agents via verifiable safety policy reasoning

Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pages 8313–8344, 2025

2025
[14]

OW ASP LLM top 10 for large language model applications, version 2.0

OW ASP Foundation. OW ASP LLM top 10 for large language model applications, version 2.0. https://owasp.org/ www-project-top-10-for-large-language-model-applications/, 2025

2025
[15]

MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems

MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. https: //atlas.mitre.org, 2025

2025
[16]

The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration.arXiv preprint arXiv:2509.14284, 2025

Vaidehi Patil, Elias Stengel-Eskin, and Mohit Bansal. The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration.arXiv preprint arXiv:2509.14284, 2025

work page arXiv 2025
[17]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[18]

Exposing weak links in multi-agent systems under adversarial prompting.arXiv preprint arXiv:2511.10949, 2025

Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi, et al. Exposing weak links in multi-agent systems under adversarial prompting.arXiv preprint arXiv:2511.10949, 2025. 10 APREPRINT-

work page arXiv 2025
[19]

GPT-20B (OpenAI open-weights MoE 20b)

OpenAI. GPT-20B (OpenAI open-weights MoE 20b). Model weights:openai/gpt-oss-20b, 2025

2025
[20]

The Llama 3 herd of models, 2024

AI @ Meta Llama Team. The Llama 3 herd of models, 2024. Llama 3.1-8B-Instruct used as classifier and evaluator

2024
[21]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXiv:23...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023. LlamaGuard-7b safety classifier

2023
[23]

KoalaAI/Text-Moderation: DistilBERT-based content moderation model

KoalaAI. KoalaAI/Text-Moderation: DistilBERT-based content moderation model. Hugging Face model hub: KoalaAI/Text-Moderation, 2023

2023
[24]

Detoxify

Laura Hanu and Unitary Team. Detoxify. GitHub repository: https://github.com/unitaryai/detoxify, 2020

2020
[25]

AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024

work page arXiv 2024
[26]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, v...

2024
[27]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, Miami, Flori...

2024
[28]

Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M)

Meta Llama Team. Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M). Hugging Face model hub:meta-llama/Prompt-Guard-86M, 2024

2024
[29]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt- bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , US...

2023
[30]

The Llama 3 herd of models (includes Llama Guard 3), 2024

AI Meta Llama Team. The Llama 3 herd of models (includes Llama Guard 3), 2024. Llama Guard 3 (13-category safety classifier) released as part of the Llama 3.1 suite; no standalone paper

2024
[31]

Overconfidence in LLM-as-a-judge: Diagnosis and confidence- driven solution.arXiv preprint arXiv:2508.06225,

Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, and Lizi Liao. Overconfidence in llm-as-a-judge: Diagnosis and confidence-driven solution.arXiv preprint arXiv:2508.06225, 2025

work page arXiv 2025
[32]

G-eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 11

2023