Recognition: 2 theorem links
· Lean TheoremCooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
Pith reviewed 2026-05-13 21:58 UTC · model grok-4.3
The pith
LLMs lack food safety alignment and produce harmful advice under jailbreaks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current LLMs exhibit sparse safety alignment in the food-related domain, easily succumbing to a few canonical jailbreak strategies. When compromised, LLMs frequently generate actionable yet harmful instructions. Existing LLM-based guardrails systematically overlook these domain-specific threats. To address this, we introduce FoodGuardBench and FoodGuard-4B.
What carries the argument
FoodGuardBench, a benchmark of 3,339 FDA-grounded queries combined with jailbreak attacks to measure LLM safety failures in food preparation tasks.
If this is right
- LLMs used for recipe or meal-planning advice could recommend unsafe food storage or preparation steps that cause illness.
- Malicious actors can more readily obtain concrete harmful food instructions than in other domains.
- Guardrail developers must add domain-specific training data such as FDA rules to catch food safety violations.
- Specialized fine-tuned models like FoodGuard-4B can be inserted as filters to reduce the rate of harmful outputs.
Where Pith is reading between the lines
- Similar sparse alignment likely appears in other regulated health domains such as medication advice or nutrition claims.
- The method of grounding benchmarks in official guidelines could be repeated for legal or financial query risks.
- Public-facing cooking assistants would benefit from continuous real-world query monitoring beyond any static benchmark.
Load-bearing premise
The 3,339 queries built from FDA guidelines and representative jailbreak examples cover the main food safety risks users would actually pose to LLMs.
What would settle it
Test the same LLMs and guardrails on a new collection of food safety queries drawn from real user forums or public health incident reports and check whether failure rates match the benchmark results.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed for everyday tasks, including food preparation and health-related guidance. However, food safety remains a high-stakes domain where inaccurate or misleading information can cause severe real-world harm. Despite these risks, current LLMs and safety guardrails lack rigorous alignment tailored to domain-specific food hazards. To address this gap, we introduce FoodGuardBench, the first comprehensive benchmark comprising 3,339 queries grounded in FDA guidelines, designed to evaluate the safety and robustness of LLMs. By constructing a taxonomy of food safety principles and employing representative jailbreak attacks (e.g., AutoDAN and PAP), we systematically evaluate existing LLMs and guardrails. Our evaluation results reveal three critical vulnerabilities: First, current LLMs exhibit sparse safety alignment in the food-related domain, easily succumbing to a few canonical jailbreak strategies. Second, when compromised, LLMs frequently generate actionable yet harmful instructions, inadvertently empowering malicious actors and posing tangible risks. Third, existing LLM-based guardrails systematically overlook these domain-specific threats, failing to detect a substantial volume of malicious inputs. To mitigate these vulnerabilities, we introduce FoodGuard-4B, a specialized guardrail model fine-tuned on our datasets to safeguard LLMs within food-related domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FoodGuardBench, a benchmark of 3,339 queries grounded in FDA guidelines and combined with jailbreak templates (AutoDAN, PAP), to evaluate LLMs and existing guardrails on food-safety alignment. It identifies three vulnerabilities—sparse domain-specific safety alignment, generation of actionable harmful instructions, and systematic guardrail blind spots—and proposes FoodGuard-4B, a fine-tuned 4B-parameter guardrail, as mitigation.
Significance. If the benchmark queries prove representative of real user intent, the work is significant for exposing an under-aligned high-stakes domain and supplying both diagnostic evidence and a concrete guardrail. The empirical focus on concrete failure modes and the release of a specialized model could usefully inform deployment practices in food-related applications.
major comments (3)
- [Benchmark construction] Benchmark construction (abstract and §3): queries are formed by grafting FDA guideline fragments onto canonical jailbreak templates, yet no validation against real query logs, expert plausibility review, or coverage of naturalistic/culturally specific phrasing is reported. This directly affects whether measured failure rates support the claim of tangible real-world risks.
- [Evaluation results] Evaluation results (abstract): the manuscript states that systematic evaluation reveals three critical vulnerabilities, but provides no quantitative metrics, success rates, error bars, per-model breakdowns, or statistical tests. Without these, the magnitude and robustness of the reported vulnerabilities cannot be assessed.
- [FoodGuard-4B] FoodGuard-4B (proposed mitigation section): details on the fine-tuning dataset composition, training hyperparameters, and head-to-head performance against existing guardrails on the benchmark are missing, leaving the effectiveness of the proposed solution unverified.
minor comments (2)
- Clarify the precise taxonomy of food-safety principles and the exact procedure for combining FDA text with jailbreak templates.
- Add explicit statements on query deduplication, length distribution, and any filtering steps applied to the 3,339 queries.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and §3): queries are formed by grafting FDA guideline fragments onto canonical jailbreak templates, yet no validation against real query logs, expert plausibility review, or coverage of naturalistic/culturally specific phrasing is reported. This directly affects whether measured failure rates support the claim of tangible real-world risks.
Authors: We agree that additional validation would strengthen claims of real-world relevance. Our construction deliberately anchors queries in FDA guidelines for factual accuracy and employs established jailbreak templates (AutoDAN, PAP) to isolate domain-specific vulnerabilities. In the revised version we will expand §3 with an explicit limitations subsection discussing the absence of proprietary query logs, the rationale for the grafting approach, and plans for future expert review and cultural coverage. We will also report the internal consistency checks performed during dataset curation. revision: partial
-
Referee: [Evaluation results] Evaluation results (abstract): the manuscript states that systematic evaluation reveals three critical vulnerabilities, but provides no quantitative metrics, success rates, error bars, per-model breakdowns, or statistical tests. Without these, the magnitude and robustness of the reported vulnerabilities cannot be assessed.
Authors: Section 4 already contains per-model success rates, confusion matrices, and breakdowns across the 3,339 queries. To improve clarity we will (1) revise the abstract to include headline quantitative results (e.g., average attack success rates and guardrail detection gaps) and (2) add error bars plus any applicable statistical comparisons to the existing tables and figures. revision: yes
-
Referee: [FoodGuard-4B] FoodGuard-4B (proposed mitigation section): details on the fine-tuning dataset composition, training hyperparameters, and head-to-head performance against existing guardrails on the benchmark are missing, leaving the effectiveness of the proposed solution unverified.
Authors: We will expand the FoodGuard-4B section with the requested details: dataset composition (size, balance of safe/unsafe examples, source distribution), full training hyperparameters (learning rate, batch size, epochs, optimizer), and direct benchmark comparisons against Llama-Guard, OpenAI moderation, and other baselines, including numerical performance deltas. revision: yes
Circularity Check
Empirical benchmark evaluation with no load-bearing circular derivations
full rationale
The paper constructs FoodGuardBench from external FDA guidelines and standard jailbreak templates (AutoDAN, PAP), then reports empirical failure rates of existing LLMs and guardrails on those queries before fine-tuning FoodGuard-4B. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. Central claims rest on direct measurement rather than self-definition or self-citation chains. A score of 2 accounts for routine self-citations in related-work sections that do not carry the main results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption FDA guidelines are treated as the authoritative source for food safety principles.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe introduce FoodGuardBench, the first comprehensive benchmark comprising 3,339 queries grounded in FDA guidelines... taxonomy of food safety principles... AutoDAN and PAP
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearCurrent LLMs exhibit sparse safety alignment in the food-related domain... FoodGuard-4B... fine-tuned on our datasets
Reference graph
Works this paper leans on
-
[1]
Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840,
-
[2]
Juli Bakagianni, Korbinian Randl, Guido Rocchietti, Cosimo Rulli, Franco Maria Nardini, Aron Henriksson, Salvatore Trani, Anna Romanova, and John Pavlopoulos. Foodsafesum: Enabling natural language processing applications for food safety document summa- rization and analysis. InConference on Empirical Methods in Natural Language Processing (EMNLP), Novemb...
-
[3]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/ 3442188.3445922. Zana Buc ¸inca, Maja Barbara Malaya, and Krzysztof Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in ai-assisted decision-making.Proc. ACM Hum. Comput. Interact., 5(CSCW1):188:1–188:21,
-
[4]
doi:10.1145/3449287 Gilad Chen, Stanley M
doi: 10.1145/3449287. URL https://doi.org/10.1145/3449287. Antonella Cavazza, Monica Mattarozzi, Arianna Franzoni, and Maria Careri. A spotlight on analytical prospects in food allergens: From emerging allergens and novel foods to bioplastics and plant-based sustainable food contact materials.Food Chemistry, 388:132951,
-
[5]
Jailbreaking Black Box Large Language Models in Twenty Queries
URL https://arxiv.org/abs/2310.08419. Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, and Yang Gao. Medsentry: Understanding and mitigating safety risks in medical llm multi-agent systems.arXiv preprint arXiv:2505.20824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Artificial Intelligence for Food Innovation
Bianca Datta, Markus J Buehler, Yvonne Chow, Kristina Gligoric, Dan Jurafsky, David L Kaplan, Rodrigo Ledesma-Amaro, Giorgia Del Missier, Lisa Neidhardt, Karim Pichara, et al. Ai for sustainable future foods.arXiv preprint arXiv:2509.21556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za ¨ıd Harchaoui, and Yejin Choi
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za ¨ıd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate S...
work page 2023
-
[8]
Fred Fung, Huei-Shyong Wang, and Suresh Menon
URL http://papers.nips.cc/paper files/paper/2023/hash/ deb3c28192f979302c157cb653c15e90-Abstract-Conference.html. Fred Fung, Huei-Shyong Wang, and Suresh Menon. Food safety in the 21st century.Biomedi- cal journal, 41(2):88–95,
work page 2023
-
[9]
URLhttps://arxiv.org/abs/2406.12793. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aureli...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URLhttps://arxiv.org/abs/2407.21783. Shabnam Hassani, Mehrdad Sabetzadeh, and Daniel Amyot. An empirical study on llm- based classification of requirements-related provisions in food-safety regulations.Empiri- cal Software Engineering, 30(3):72,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Association for Computing Machinery. ISBN 9798400714542. doi: 10.1145/3711896.3737384. URLhttps://doi.org/10.1145/3711896.3737384. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai co...
-
[12]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
URL https: //arxiv.org/abs/2312.06674. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks T rack, 2023a....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3571730
-
[13]
Suyog Joshi, Soumyajit Basu, Lipika Dey, and Partha Pratim Das
URL https://arxiv.org/ abs/2603.03415. Suyog Joshi, Soumyajit Basu, Lipika Dey, and Partha Pratim Das. LLM driven legal text analytics: A case study for food safety violation cases. In Ashutosh Modi, Saptarshi Ghosh, Asif Ekbal, Pawan Goyal, Sarika Jain, Abhinav Joshi, Shivani Mishra, Debtanu Datta, Shounak Paul, Kshetrimayum Boynao Singh, and Sandeep Kum...
-
[14]
Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, and Qiang Zhang. Scisafeeval: A comprehensive benchmark for safety alignment of large language models in scientific tasks, 2024a. URLhttps://arxiv.org/abs/2410.03769. Xuan Li, Zhanke Zhou,...
-
[15]
When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life
Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, et al. When helpers become hazards: A benchmark for analyzing multimodal llm-powered safety in daily life.arXiv preprint arXiv:2601.04043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URLhttps://arxiv.org/abs/2507.19672. Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. InFirst Conference on Language Modeling,
-
[17]
Dy- namic guided and domain applicable safeguards for enhanced security in large language models
Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, and Yu Li. Dy- namic guided and domain applicable safeguards for enhanced security in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025,
work page 2025
-
[18]
URLhttps://arxiv.org/abs/2402.16717. Meta AI. Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/, 2024a. Meta AI. Llama 3.3 model card and prompt formats. https://www.llama.com/docs/model- cards-and-prompt-formats/llama3, 2024b. Accessed: 2024-12-06. Meta AI. Llama acceptable use policy.https://ai.meta.com/ll...
-
[19]
ISSN 2072-6643. doi: 10.3390/nu17223515. URL https://www.mdpi.com/ 2072-6643/17/22/3515. OpenAI. Gpt-4.1 system card. Technical report, OpenAI, 2024a. URL https://openai.com/ index/gpt-4-1/. OpenAI. Gpt-4o system card. Technical report, OpenAI, October 2024b. URL https: //openai.com/index/gpt-4o-system-card. OpenAI. Usage policies. https://openai.com/zh-H...
-
[20]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,
work page 2025
-
[21]
URLhttps://arxiv.org/abs/2412.15115. Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
URLhttps://arxiv.org/abs/2310.10501. Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns.Healthcare, 11(6),
-
[23]
doi: 10.3390/healthcare11060887
ISSN 2227-9032. doi: 10.3390/healthcare11060887. URL https://www.mdpi.com/2227-9032/11/ 6/887. Callum Sharrock, Lukas Petersson, Hanna Petersson, Axel Backlund, Axel Wennstr ¨om, Kristoffer Nordstr¨om, and Elias Aronsson. Butter-bench: Evaluating llm controlled robots for practical intelligence.arXiv preprint arXiv:2510.21860,
-
[24]
ISSN 1931-0145. doi: 10.1145/3748239.3748242. URL https://doi.org/10.1145/3748239.3748242. Alain D. Starke, Jutta Dierkes, G ¨ulen Arslan Lied, Gloria A.B. Kasangu, and Christoph Trattner. Supporting healthier food choices through ai-tailored advice: A research agenda.PEC Innovation, 6:100372,
-
[25]
doi: https://doi.org/ 15 Preprint 10.1016/j.pecinn.2025.100372
ISSN 2772-6282. doi: https://doi.org/ 15 Preprint 10.1016/j.pecinn.2025.100372. URL https://www.sciencedirect.com/science/article/ pii/S2772628225000019. Vahidullah Tac, Christopher Gardner, and Ellen Kuhl. Generative artificial intelligence creates delicious, sustainable, and nutritious burgers.arXiv preprint arXiv:2602.03092,
-
[26]
What can large language models do for sustainable food?arXiv preprint arXiv:2503.04734,
Anna T Thomas, Adam Yee, Andrew Mayne, Maya B Mathur, Dan Jurafsky, and Kristina Gligori´c. What can large language models do for sustainable food?arXiv preprint arXiv:2503.04734,
-
[27]
U.S. Food and Drug Administration. Food Code 2022: Recommendations of the United States Public Health Service, Food and Drug Administration,
work page 2022
-
[28]
fda.gov/food/fda-food-code/food-code-2022
URL https://www. fda.gov/food/fda-food-code/food-code-2022 . Most recent version dated January 18,
work page 2022
-
[29]
PATeam at SemEval- 2025 task 9: LLM-augmented fusion for AI-driven food safety hazard detection
Xue Wan, Fengping Su, Ling Sun, Yuyang Lin, and Pengfei Chen. PATeam at SemEval- 2025 task 9: LLM-augmented fusion for AI-driven food safety hazard detection. In Sara Rosenthal, Aiala Ros´a, Debanjan Ghosh, and Marcos Zampieri (eds.),Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pp. 1912–1918, Vienna, Austria, July
work page 2025
-
[30]
Association for Computational Linguistics. ISBN 979-8-89176-273-2. URLhttps://aclanthology.org/2025.semeval-1.249/. Ke Wang, Miranda Mirosa, Yakun Hou, and Phil Bremer. Advancing food safety behavior with ai: Innovations and opportunities in the food manufacturing sector.T rends in food science & technology, 161:105050,
work page 2025
-
[31]
Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,
-
[32]
Language Model Cascades: Token-Level Uncertainty and Beyond
doi: 10.48550/ARXIV .2308.13387. URLhttps://doi.org/10.48550/arXiv.2308.13387. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neura...
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[33]
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky¨urek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, He- lena G´omez-Adorno, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American C...
work page 2024
-
[34]
In: Al-Onaizan, Y., Bansal, M., Chen, Y.N
doi: 10.18653/V1/2024. NAACL-LONG.102. URLhttps://doi.org/10.18653/v1/2024.naacl-long.102. Zhen Xiang, Aliyah R. Hsu, Austin V . Zane, Aaron E. Kornblith, Margaret J. Lin-Martore, Jasmanpreet C. Kaur, Vasuda M. Dokiparthi, Bo Li, and Bin Yu. Cdr-agent: Intelligent selection and execution of clinical decision rules using large language model agents,
-
[35]
16 Preprint Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen
URLhttps://arxiv.org/abs/2505.23055. 16 Preprint Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. In Findings of the Association for Computational Linguistics: NAACL 2024,
-
[36]
URLhttps://arxiv.org/abs/2505.09388. Zheng Xin Yong, Cristina Menghini, and Stephen Bach. Low-resource languages jailbreak GPT-4. InSocially Responsible Language Modelling Research,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
URL https://openreview. net/forum?id=pn83r8V2sv. Huizi Yu, Lizhou Fan, Lingyao Li, Jiayan Zhou, Zihui Ma, Lu Xian, Wenyue Hua, Sijia He, Mingyu Jin, Yongfeng Zhang, Ashvin Gandhi, and Xin Ma. Large language models in biomedical and health informatics: A review with bibliometric analysis.Journal of Healthcare Informatics Research, 8(4):658–711, 12 2024a. I...
-
[38]
doi: 10.18653/V1/2024.FINDINGS-EMNLP .79. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.79. Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, and Yanfang Ye. Food4all: A multi-agent framework for real-time free food discovery with integrated nutritional metadata.arXiv preprint arXiv:2510.18289,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp 2024
-
[39]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu, Rong Zhang, Shibin Wu, Shuo Zhang, Tao He, Tianyi Tang, Tingyu Xia, Wei Liao, Wei...
work page internal anchor Pith review arXiv
-
[40]
Weak-to-strong jailbreaking on large language models
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. InForty- second International Conference on Machine Learning, 2025b. 17 Preprint Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Ch...
-
[41]
Universal and Transferable Adversarial Attacks on Aligned Language Models
URL https://arxiv.org/abs/2307.15043. A Appendix This appendix contains additional details for the“Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models”. The appendix is shown as follows: • §BLLM Usage Statement • §CData Construction –C.1 Data Generation –C.2 Data Distribution • §DExperiment Setting –D.1 Attack Experiment...
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Based on this categorization, the final Fleiss’ Kappa among the three human experts is 0.4684
The three human experts were selected based on a pilot test using 10 random samples from the dataset. Based on this categorization, the final Fleiss’ Kappa among the three human experts is 0.4684. This demonstrates a moderate level of agreement among the three human experts. 19 Preprint Score Classification Description & Criteria 0Completely Invalid The m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.