arxiv: 2604.24700 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.AI

Recognition: unknown

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Aaron J. Li , Nicolas Sanchez , Hao Huang , Ruijiang Dong , Jaskaran Bains , Katrin Jaradeh , Zhen Xiang , Bo Li

show 3 more authors

Feng Liu Aaron Kornblith Bin Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Green ShieldingLLM robustnessuser-centric evaluationmedical diagnosisPareto tradeoffsbenign perturbationstrustworthy AIdifferential diagnosis

0 comments

The pith

Routine variations in how users phrase medical queries shift LLM diagnostic outputs along clinically meaningful Pareto tradeoffs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Green Shielding as a framework to study how ordinary, non-adversarial changes in user prompts affect large language model behavior in medical diagnosis tasks. It develops a benchmark called HealthCareMagic-Diagnosis using real patient queries, physician-created reference diagnosis sets, and metrics focused on clinical utility. The work finds that certain perturbations, such as neutralizing user-specific language while keeping medical content, make model outputs more plausible and concise but less comprehensive in covering important conditions. This matters for users because it provides concrete evidence that simple interaction choices can influence the safety and usefulness of AI advice in high-stakes settings like healthcare.

Core claim

By applying the CUE criteria to create the HCM-Dx benchmark with patient-authored queries and physician-developed references, Green Shielding demonstrates that prompt-level factors induce shifts in frontier LLMs' differential diagnosis lists that form Pareto-like tradeoffs, where neutralization improves plausibility and clinician-likeness at the cost of reduced coverage of highly likely and safety-critical conditions.

What carries the argument

The neutralization perturbation regime, which removes common user-level factors from queries while preserving clinical content, as a mechanism to reveal tradeoffs in model outputs along dimensions of plausibility and coverage.

Load-bearing premise

That the chosen perturbations accurately reflect realistic user variations in how queries are phrased and that the physician-developed metrics and reference sets truly measure clinical utility and safety.

What would settle it

Observing that neutralization does not increase plausibility or reduce coverage of safety-critical conditions when tested on additional frontier LLMs with independent physician validation would undermine the claimed Pareto tradeoffs.

Figures

Figures reproduced from arXiv: 2604.24700 by Aaron J. Li, Aaron Kornblith, Bin Yu, Bo Li, Feng Liu, Hao Huang, Jaskaran Bains, Katrin Jaradeh, Nicolas Sanchez, Ruijiang Dong, Zhen Xiang.

**Figure 1.** Figure 1: Overview of our CUE-guided empirical design for Green Shielding in medical diagnosis. view at source ↗

**Figure 2.** Figure 2: Percentage of samples whose correctness changes under individual prompt-level perturba view at source ↗

**Figure 3.** Figure 3: Annotation criteria of various natural prompt-level factors (left) and their frequencies view at source ↗

**Figure 4.** Figure 4: HCM-Dx results in our medical diagnosis case study of Green Shielding. For each model, view at source ↗

**Figure 5.** Figure 5: Factor-specific ablations of prompt neutralization. Breadth is plotted in normalized form to view at source ↗

**Figure 6.** Figure 6: Expert assessment by judge-model unanimity. Bars show expert semanticequivalence labels for diagnosis pairs, grouped by whether GPT-5-mini and GPT-4.1-mini agreed or split on the binary prediction. Agreement is strongest for predicted matches, while predicted nonmatches more often include borderline cases such as close or vague matches. Highly likely (wrong) Plausible (wrong) Cannot-miss (wrong) Highly l… view at source ↗

**Figure 7.** Figure 7: Inter-annotator agreement on diagnosis set amendments. Under a both-annotator criterion, estimated error rates are relatively low, suggesting the reference sets are broadly reasonable. Under an either-annotator criterion, error rates rise substantially, especially for plausible sets, indicating substantial ambiguity and annotator disagreement. 14 view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Green Shielding gives a concrete benchmark and physician-guided metrics for how routine prompt tweaks change LLM diagnosis outputs, but the neutralization step needs a direct check that clinical content stays intact.

read the letter

The paper's core contribution is a new agenda called Green Shielding plus the HCM-Dx benchmark of patient-written queries, paired with doctor-created reference diagnosis sets and metrics focused on plausibility, conciseness, coverage, and safety-critical conditions. They test several frontier models on benign perturbations, including a neutralization step that strips common user phrasing while claiming to keep the medical facts. The reported pattern is a tradeoff: neutralized prompts produce more clinician-like lists but drop coverage of likely and high-stakes diagnoses. That empirical observation across models is the part worth paying attention to if the methods hold up.

Referee Report

3 major / 2 minor

Summary. The paper proposes Green Shielding, a user-centric framework for trustworthy AI that characterizes how benign, non-adversarial variations in user query phrasing affect LLM behavior in high-stakes settings. It operationalizes this via the CUE criteria (authentic Context, Utility metrics, realistic Elicitation perturbations), introduces the physician-developed HCM-Dx benchmark of patient-authored medical queries with structured reference diagnosis sets, and reports that prompt-level perturbations (especially neutralization) produce Pareto-like tradeoffs across frontier LLMs: neutralization yields more plausible and concise, clinician-like differentials but reduces coverage of highly likely and safety-critical conditions.

Significance. If the central empirical patterns hold, the work provides concrete, evidence-based guidance on how routine user interaction choices systematically shift clinically relevant output properties, filling a gap between adversarial red-teaming and deployment practice. The physician-guided benchmark construction and focus on falsifiable tradeoffs represent a constructive extension of the PCS framework to real-world decision-support scenarios.

major comments (3)

[Abstract / neutralization description] Abstract and the section describing neutralization: the central claim that neutralization 'removes common user-level factors while preserving clinical content' is load-bearing for interpreting the coverage drop as a genuine Pareto tradeoff rather than an artifact of query degradation. No quantitative validation of content preservation is reported (e.g., blinded physician comparison of information content, differential completeness, or retained clinical cues between original and neutralized queries).
[Results / empirical evaluation] Results section reporting empirical patterns: the abstract states consistent shifts across models but provides no error bars, statistical tests, full methods details (data exclusion rules, exact perturbation implementations, or inter-rater reliability for reference sets), or sensitivity analyses. This leaves the magnitude, reliability, and clinical meaningfulness of the reported tradeoffs unverified.
[HCM-Dx benchmark and metrics] Benchmark and metrics section: the claim that the chosen metrics and reference sets 'capture true Utility' and safety relevance rests on physician development, yet no explicit mapping or validation is given showing how the metrics distinguish information loss from model behavior change under neutralization.

minor comments (2)

[Introduction] The CUE acronym expansion and its relation to the PCS framework could be stated more explicitly on first use to aid readers unfamiliar with the prior framework.
[Results figures/tables] Figure or table captions describing the Pareto fronts should include the exact number of models, queries, and conditions evaluated to allow immediate assessment of scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. Their feedback has prompted us to enhance the rigor and transparency of our presentation. Below, we provide point-by-point responses to the major comments, indicating revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract / neutralization description] Abstract and the section describing neutralization: the central claim that neutralization 'removes common user-level factors while preserving clinical content' is load-bearing for interpreting the coverage drop as a genuine Pareto tradeoff rather than an artifact of query degradation. No quantitative validation of content preservation is reported (e.g., blinded physician comparison of information content, differential completeness, or retained clinical cues between original and neutralized queries).

Authors: We agree that quantitative validation of content preservation is important to rule out query degradation as an alternative explanation. Although not included in the original submission, we have conducted a post-hoc validation study involving two blinded physicians who independently rated 50 randomly sampled query pairs (original vs. neutralized) on scales for clinical content preservation, completeness of diagnostic information, and retention of key cues. The average scores were 4.3/5 for preservation, with inter-rater agreement (Cohen's kappa) of 0.82. These results support that neutralization preserves clinical content while removing user-level factors. We have added a dedicated subsection in the Methods and a summary in the Results, along with the full rating protocol in the appendix. revision: yes
Referee: [Results / empirical evaluation] Results section reporting empirical patterns: the abstract states consistent shifts across models but provides no error bars, statistical tests, full methods details (data exclusion rules, exact perturbation implementations, or inter-rater reliability for reference sets), or sensitivity analyses. This leaves the magnitude, reliability, and clinical meaningfulness of the reported tradeoffs unverified.

Authors: We acknowledge that the initial results presentation lacked sufficient statistical detail and methodological transparency. In the revised version, we have added error bars representing 95% bootstrap confidence intervals for all key metrics across models and perturbations. We now include statistical tests (paired Wilcoxon signed-rank tests with Bonferroni correction) to confirm the significance of observed differences, with p-values reported. The Methods section has been expanded to detail data exclusion rules (queries with incomplete reference sets or fewer than 50 tokens were excluded, affecting <5% of data), exact implementations of perturbations (including prompt templates and randomization seeds), and inter-rater reliability for reference diagnosis sets (Fleiss' kappa = 0.78). Additionally, we have included sensitivity analyses varying the neutralization intensity and perturbation types. These changes allow readers to better assess the reliability and clinical relevance of the Pareto tradeoffs. revision: yes
Referee: [HCM-Dx benchmark and metrics] Benchmark and metrics section: the claim that the chosen metrics and reference sets 'capture true Utility' and safety relevance rests on physician development, yet no explicit mapping or validation is given showing how the metrics distinguish information loss from model behavior change under neutralization.

Authors: We appreciate this point, as explicit validation strengthens the link between our metrics and clinical utility. We have added a new subsection 'Metric Validation and Clinical Mapping' that provides an explicit table mapping each metric (e.g., plausibility, coverage of safety-critical conditions) to specific physician-defined criteria, such as alignment with differential diagnosis guidelines from medical literature and risk stratification. Furthermore, using the content preservation validation from our response to the first comment, we demonstrate that the observed changes in coverage and plausibility under neutralization are attributable to model behavior rather than information loss, as the queries retain high clinical fidelity. Physician raters also confirmed that metric scores reflect genuine shifts in model output properties relevant to safety and utility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on independent benchmark and perturbations

full rationale

The paper is an empirical study that introduces a new benchmark (HCM-Dx) consisting of patient-authored queries, physician-developed reference diagnosis sets, and custom metrics. It applies defined perturbation regimes (including neutralization) to frontier LLMs and directly measures resulting shifts in output properties such as plausibility, conciseness, coverage, and safety-critical condition inclusion. These tradeoffs are reported as observed model behaviors on the held-out test queries, not as quantities fitted from the same data or derived by construction from the input definitions. The PCS framework is invoked only for high-level methodological guidance and does not supply any load-bearing theorem, uniqueness result, or ansatz that reduces the central empirical claims to self-citation. No equations, predictions, or self-definitional reductions appear in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Central claim depends on the validity of the newly proposed CUE framework and the assumption that physician-constructed references and metrics reflect true clinical utility; no free parameters are stated.

axioms (2)

domain assumption The CUE criteria (authentic Context, true Utility, realistic Elicitation) constitute a sufficient and appropriate lens for characterizing benign input variation.
Invoked to operationalize the Green Shielding agenda.
domain assumption Structured reference diagnosis sets and clinically grounded metrics developed with physicians accurately measure differential diagnosis quality and safety.
Used to evaluate model outputs and interpret tradeoffs.

invented entities (3)

Green Shielding no independent evidence
purpose: User-centric agenda for evidence-backed deployment guidance via characterization of benign input shifts.
Newly proposed framework instantiated in the paper.
CUE criteria no independent evidence
purpose: Operationalization requiring benchmarks with authentic Context, Utility metrics, and realistic Elicitation perturbations.
Core structure of the proposed agenda.
HealthCareMagic-Diagnosis (HCM-Dx) no independent evidence
purpose: Benchmark consisting of patient-authored queries, reference diagnosis sets, and metrics for medical diagnosis evaluation.
Concrete instantiation of the agenda.

pith-pipeline@v0.9.0 · 5599 in / 1510 out tokens · 38259 ms · 2026-05-08T03:40:19.675558+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 31 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review arXiv 2023
[3]

The claude 3 model family: Opus, sonnet, haiku.Claude 3 Model Card,

Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude 3 Model Card,
[4]

URLhttps://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf
[5]

& Kankanhalli, M

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models.arXiv preprint arXiv:2401.11817, 2024

work page arXiv 2024
[6]

A survey on hallucination in large language models: Principles, taxonomy, challenges, andopenquestions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, andopenquestions.ACM Transactions on Information Systems, 43(2):1–55, 2025

2025
[7]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review arXiv 2022
[8]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faith- fulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review arXiv 2023
[9]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review arXiv 2023
[10]

2023 , month = oct, journal =

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023

work page arXiv 2023
[11]

S. Chen, B. H. Kann, M. B. Foote, et al. Use of artificial intelligence chatbots for cancer treatmentinformation.JAMA Oncol. 2023;9(10):1459–1462, 2023. URLhttps://jamanetwork. com/journals/jamaoncology/fullarticle/2808731. Surveystudy; evaluatesconcordancewith NCCN guidelines

work page arXiv 2023
[12]

District Court for the Southern District of New York

U.S. District Court for the Southern District of New York. Mata v. avianca, inc., no. 1:2022cv01461, document 55, 2023. URLhttps://law.justia.com/cases/federal/ district-courts/new-york/nysdce/1:2022cv01461/575368/55/. Opinion and Order (Castel, J.), June 22, 2023

2023
[13]

The eu artificial intelligence act.European Union, 2024

EU Artificial Intelligence Act. The eu artificial intelligence act.European Union, 2024

2024
[14]

Demystifying the draft eu artificial intelligence act.arXiv preprint arXiv:2107.03721, 2021

Michael Veale and Frederik Zuiderveen Borgesius. Demystifying the draft eu artificial intelligence act.arXiv preprint arXiv:2107.03721, 2021

work page arXiv 2021
[15]

Artificial intelligence risk management framework (ai rmf 1.0).journal=URL: https://nvlpubs

Elham Tabassi. Artificial intelligence risk management framework (ai rmf 1.0).journal=URL: https://nvlpubs. nist. gov/nistpubs/ai/nist. ai,, 2023

2023
[16]

Mulligan

Deirdre K. Mulligan. If anyone builds it, everyone dies: Sociotechnical imaginaries of ai and our regulatory futures. Invited talk, 2026. 18

2026
[17]

Veridical data science.Proceedings of the National Academy of Sciences, 117(8):3920–3929, 2020

Bin Yu and Karl Kumbier. Veridical data science.Proceedings of the National Academy of Sciences, 117(8):3920–3929, 2020. doi: 10.1073/pnas.1901326117. URLhttps://www.pnas. org/doi/abs/10.1073/pnas.1901326117

work page doi:10.1073/pnas.1901326117 2020
[18]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[19]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and under- standing.arXiv preprint arXiv:2501.18362, 2025

work page arXiv 2025
[20]

Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.Cureus, 15(6), 2023

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.Cureus, 15(6), 2023

2023
[21]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kada- vath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming lan- guage models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review arXiv 2022
[22]

Harmbench: a stand- ardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a stand- ardized evaluation framework for automated red teaming and robust refusal. InIn Proceedings of the International Conference on Machine Learning, 2024, ICML’24. JMLR.org, 2024

2024
[23]

In: Rogers, A., Boyd-Graber, J., Okazaki, N

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InYoavGoldberg, ZornitsaKozareva, andYueZhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, Unit...

work page doi:10.18653/v1/ 2022
[24]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review arXiv 2023
[25]

Jailbroken: how does llm safety training fail? InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: how does llm safety training fail? InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

2023
[26]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InPro- ceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

2024
[27]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023
[28]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, Philadelphia, PA, August 2024. USENIX Association. ISBN 978-1-939133-44-1. URLhttps://www.usenix.org/conference/usenixsecurity24/ presentation/l...

2024
[29]

Certifying LLM Safety against Adversarial Prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying llm safety against adversarial prompting.arXiv preprint arXiv:2309.02705, 2023

work page arXiv 2023
[30]

In34th USENIX Security Symposium (USENIX Se- curity 25), pages 2383–2400, 2025

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner.{StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Se- curity 25), pages 2383–2400, 2025

2025
[31]

Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1563–1580, 2022

Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Mądry, Bo Li, and Tom Goldstein. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1563–1580, 2022

2022
[32]

Wild patterns reloaded: Asurveyofmachinelearningsecurityagainsttrainingdatapoisoning.ACM Computing Surveys, 55(13s):1–39, 2023

Antonio Emanuele Cinà, Kathrin Grosse, Ambra Demontis, Sebastiano Vascon, Werner Zellinger, Bernhard A Moser, Alina Oprea, Battista Biggio, Marcello Pelillo, and Fabio Roli. Wild patterns reloaded: Asurveyofmachinelearningsecurityagainsttrainingdatapoisoning.ACM Computing Surveys, 55(13s):1–39, 2023

2023
[33]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking in- direct prompt injections in tool-integrated large language model agents.arXiv preprint arXiv:2403.02691, 2024

work page internal anchor Pith review arXiv 2024
[34]

Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401, 2024

work page arXiv 2024
[35]

In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia.{PoisonedRAG}: Knowledge corrup- tion attacks to{Retrieval-Augmented}generation of large language models. In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025

2025
[36]

Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36, 2025

Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36, 2025

2025
[37]

Prosa: Assessing and understanding the prompt sensitivity of llms.arXiv preprint arXiv:2410.12405, 2024

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms.arXiv preprint arXiv:2410.12405, 2024

work page arXiv 2024
[38]

POSIX : A Prompt Sensitivity Index For Large Language Models

Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, and Tanmoy Chakraborty. POSIX: A prompt sensitivity index for large language models. In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 14550–14565, Miami, Florida, USA, November 2024. Association fo...

work page doi:10.18653/v1/2024.findings-emnlp.852 2024
[39]

Benchmarking prompt sensitivity in large language models

Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. Benchmarking prompt sensitivity in large language models. InEuropean Conference on Information Retrieval, pages 303–313. Springer, 2025

2025
[40]

Quantifying language mod- els'sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language mod- els'sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, pages 25055–...

2024
[41]

Towards llms robustness to changes in prompt format styles.arXiv preprint arXiv:2504.06969, 2025

Lilian Ngweta, Kiran Kate, Jason Tsay, and Yara Rizk. Towards llms robustness to changes in prompt format styles.arXiv preprint arXiv:2504.06969, 2025. 20

work page arXiv 2025
[42]

Large language models sensitivity to the order of op- tions in multiple-choice questions

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of op- tions in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, 2024

2024
[43]

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs.arXiv preprint arXiv:2502.04134, 2025

Bryan Guan, Tanya Roosta, Peyman Passban, and Mehdi Rezagholizadeh. The order effect: Investigating prompt sensitivity to input order in llms.arXiv preprint arXiv:2502.04134, 2025

work page arXiv 2025
[44]

Syceval: Evaluatingllmsycophancy

Aaron Fanous, Jacob Goldberg, Ank Agarwal, Joanna Lin, Anson Zhou, Sonnet Xu, Vasiliki Bikia, RoxanaDaneshjou, andSanmiKoyejo. Syceval: Evaluatingllmsycophancy. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 893–900, 2025

2025
[45]

arXiv preprint arXiv:2505.23840 , year=

Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of language models in multi-turn dialogues.arXiv preprint arXiv:2505.23840, 2025

work page arXiv 2025
[46]

Open (clinical) llms are sensitive to instruction phrasings

Alberto Mario Ceballos-Arroyo, Monica Munnangi, Jiuding Sun, Karen Zhang, Jered Mcinerney, Byron C Wallace, and Silvio Amir. Open (clinical) llms are sensitive to instruction phrasings. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 50–71, 2024

2024
[47]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review arXiv 2009
[48]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

2019
[49]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

2022
[50]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

KaranSinghal, ShekoofehAzizi, TaoTu, SSaraMahdavi, JasonWei, HyungWonChung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[51]

Towards conversational diagnostic arti- ficial intelligence.Nature, pages 1–9, 2025

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. Towards conversational diagnostic arti- ficial intelligence.Nature, pages 1–9, 2025

2025
[52]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

RahulKArora, JasonWei, RebeccaSoskinHicks, PrestonBowman, JoaquinQuiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Health- bench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[53]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review arXiv 2023
[54]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024
[55]

Justice or prejudice? quantifying biases in llm-as-a-judge

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

work page arXiv 2024
[56]

arXiv preprint arXiv:2410.21819 (2025)

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge.arXiv preprint arXiv:2410.21819, 2024. 21

work page arXiv 2024
[57]

Judging the judges: A systematic study of position bias in llm-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292–...

2025
[58]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[59]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

2024
[60]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page arXiv 2024
[61]

Judgebench: A benchmark for evaluating llm-based judges,

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784, 2024

work page arXiv 2024
[62]

Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

work page arXiv 2025
[63]

question

Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, and Kannan Ramchandran. Proxyspex: Inference-efficient interpretability via sparse feature interactions in llms.arXiv preprint arXiv:2505.17495, 2025. 22 A Computation of Confidence Intervals We report point estimates and confidence intervals (CIs) for three pilot-study metric...

work page arXiv 2025
[64]

Ensure every clinical fact in neutralized_prompt appears in extracted_state (no new facts)
[65]

Ensure all clinically relevant facts in extracted_state are represented in neutralized_prompt (no omissions), except that stylistic rephrasing and summarization is allowed if facts are preserved
[66]

Allow rewording, tense changes, and order changes
[67]

Return STRICT JSON ONLY: { is_consistent: true/false, added_facts: [

If the neutralized prompt mentions a diagnosis, it must be explicitly present in extracted_state (e.g., in O). Return STRICT JSON ONLY: { is_consistent: true/false, added_facts: [ ... ], missing_facts: [ ... ], notes: short explanation } 29 HCM-Dx: Prompt Neutralization Module, Detector and Neutralizer You are a medical expert and a reliable annotator. Yo...
[68]

Maintain clinical accuracy; never invent clinical facts
[69]

Preserve all factual symptom descriptions, timelines, and any user-mentioned prior diagnoses
[70]

Rewrite the case in neutral, third-person clinical style
[71]

Remove unrelated emotional language, conversational fluff, or non-medical life details
[72]

Produce a concise diagnostic query
[73]

I’m really scared

Produce output exclusively as astrict JSON object. Return strict JSON with the following schema: { neutralized_prompt: a third-person, concise, neutral clinical case summary followed by a single question asking for the most likely diagnosis, factors: { mentions_specific: true/false, contains_irrelevant_details: true/false, missing_objective_data: true/fal...