arxiv: 2604.01444 · v2 · submitted 2026-04-01 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Chaowei Xiao, Hongyi Wang, Kristina Gligori\'c, Muhao Chen, Tenghao Huang, Weidi Luo, Xiaofei Wen, Zhen Xiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:58 UTC · model grok-4.3

classification 💻 cs.CR

keywords food safetyLLM safetyjailbreak attacksbenchmarkguardrail modelFDA guidelinessafety evaluation

0 comments

The pith

LLMs lack food safety alignment and produce harmful advice under jailbreaks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds FoodGuardBench, a set of 3,339 queries drawn from FDA food safety guidelines, to test how well current LLMs handle food-related risks. It shows that standard models fail against common jailbreak tactics and then output concrete harmful instructions for food handling. Existing guardrail systems miss most of these domain-specific threats. The authors also release FoodGuard-4B, a fine-tuned model meant to block such inputs before they reach the main LLM.

Core claim

Current LLMs exhibit sparse safety alignment in the food-related domain, easily succumbing to a few canonical jailbreak strategies. When compromised, LLMs frequently generate actionable yet harmful instructions. Existing LLM-based guardrails systematically overlook these domain-specific threats. To address this, we introduce FoodGuardBench and FoodGuard-4B.

What carries the argument

FoodGuardBench, a benchmark of 3,339 FDA-grounded queries combined with jailbreak attacks to measure LLM safety failures in food preparation tasks.

If this is right

LLMs used for recipe or meal-planning advice could recommend unsafe food storage or preparation steps that cause illness.
Malicious actors can more readily obtain concrete harmful food instructions than in other domains.
Guardrail developers must add domain-specific training data such as FDA rules to catch food safety violations.
Specialized fine-tuned models like FoodGuard-4B can be inserted as filters to reduce the rate of harmful outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar sparse alignment likely appears in other regulated health domains such as medication advice or nutrition claims.
The method of grounding benchmarks in official guidelines could be repeated for legal or financial query risks.
Public-facing cooking assistants would benefit from continuous real-world query monitoring beyond any static benchmark.

Load-bearing premise

The 3,339 queries built from FDA guidelines and representative jailbreak examples cover the main food safety risks users would actually pose to LLMs.

What would settle it

Test the same LLMs and guardrails on a new collection of food safety queries drawn from real user forums or public health incident reports and check whether failure rates match the benchmark results.

Figures

Figures reproduced from arXiv: 2604.01444 by Chaowei Xiao, Hongyi Wang, Kristina Gligori\'c, Muhao Chen, Tenghao Huang, Weidi Luo, Xiaofei Wen, Zhen Xiang.

**Figure 1.** Figure 1: Consumer survey results indicate substantial openness to GenAI for food-related assistance, with meal planning and menu suggestions (47%), personalized nutrition and diet plans (45%), and grocery budgeting support (41%) among the most accepted use cases; only 15% of respondents selected none of the above. Source: PwC Voice of the Consumer Survey 2025. consequential all depend on the characteristics of the … view at source ↗

**Figure 2.** Figure 2: Data generation pipeline. To construct FoodGuardBench, we first derive seed safety principles from the FDA food safety taxonomy and regulations, such as contamination and temperature control. Next, we generate a broad spectrum of benign and harmful queries by injecting benign or malicious user intents into these seed principles. Finally, we apply similarity constraints coupled with manual review to guaran… view at source ↗

**Figure 4.** Figure 4: T-SNE visualization of the dataset distribution. 4 Experiment Setup In this section, we will introduce our experiment settings: 6 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Results of Human Evaluation on the Response of LLMs. The majority of models demonstrate the ability to provide effective information for malicious queries. 5.3 Guardrail on Food-related Domain In this section, we analyze the limitations of existing guardrails and evaluate the performance of FoodGuard-4B, which is fine-tuned on our curated dataset. By combining all vanilla harmful and jailbreak queries, w… view at source ↗

**Figure 6.** Figure 6: A prompt for generating malicious questions related to food science regulation. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: ASR of different attack methods among all models 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed for everyday tasks, including food preparation and health-related guidance. However, food safety remains a high-stakes domain where inaccurate or misleading information can cause severe real-world harm. Despite these risks, current LLMs and safety guardrails lack rigorous alignment tailored to domain-specific food hazards. To address this gap, we introduce FoodGuardBench, the first comprehensive benchmark comprising 3,339 queries grounded in FDA guidelines, designed to evaluate the safety and robustness of LLMs. By constructing a taxonomy of food safety principles and employing representative jailbreak attacks (e.g., AutoDAN and PAP), we systematically evaluate existing LLMs and guardrails. Our evaluation results reveal three critical vulnerabilities: First, current LLMs exhibit sparse safety alignment in the food-related domain, easily succumbing to a few canonical jailbreak strategies. Second, when compromised, LLMs frequently generate actionable yet harmful instructions, inadvertently empowering malicious actors and posing tangible risks. Third, existing LLM-based guardrails systematically overlook these domain-specific threats, failing to detect a substantial volume of malicious inputs. To mitigate these vulnerabilities, we introduce FoodGuard-4B, a specialized guardrail model fine-tuned on our datasets to safeguard LLMs within food-related domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A narrow but practical benchmark for food safety risks in LLMs that flags real gaps but needs tighter validation on its queries and actual numbers.

read the letter

The paper's core contribution is FoodGuardBench, a 3,339-query set built from FDA food safety guidelines and paired with standard jailbreak templates, plus the FoodGuard-4B guardrail they fine-tuned on it. It shows current models and guardrails fail on food-related prompts in three ways: weak domain alignment, generation of concrete harmful instructions, and blind spots in existing safety layers. That is the useful part. They picked a high-stakes everyday domain and turned FDA rules into testable queries instead of staying at the level of generic harm categories. Releasing both the benchmark and the small guardrail model gives others something concrete to test against or extend. That counts as real work even if the underlying fine-tuning method is ordinary. The evaluation identifies clear failure modes without overclaiming generality. The soft spots are straightforward. The abstract gives no quantitative results, no error bars, and no breakdown of how the queries were checked for realism. The construction grafts FDA fragments onto AutoDAN and PAP templates, which is efficient but leaves open whether the distribution matches what normal users actually type. Without logs, expert plausibility checks, or coverage of ambiguous or culturally specific cases, the measured failure rates could be inflated by the adversarial phrasing. That is the main limitation, not a fatal one but something that needs addressing for the claims to land solidly. This paper is for people working on applied safety in regulated domains like health or consumer advice. A reader who needs a starting benchmark for food-specific risks will get value from the artifacts and the taxonomy. It is not a broad alignment advance, but it is focused and honest about its scope. I would send it to peer review. The new resources and the concrete problem statement are worth referee time, provided the authors add the missing numbers and some validation step for the query set.

Referee Report

3 major / 2 minor

Summary. The paper introduces FoodGuardBench, a benchmark of 3,339 queries grounded in FDA guidelines and combined with jailbreak templates (AutoDAN, PAP), to evaluate LLMs and existing guardrails on food-safety alignment. It identifies three vulnerabilities—sparse domain-specific safety alignment, generation of actionable harmful instructions, and systematic guardrail blind spots—and proposes FoodGuard-4B, a fine-tuned 4B-parameter guardrail, as mitigation.

Significance. If the benchmark queries prove representative of real user intent, the work is significant for exposing an under-aligned high-stakes domain and supplying both diagnostic evidence and a concrete guardrail. The empirical focus on concrete failure modes and the release of a specialized model could usefully inform deployment practices in food-related applications.

major comments (3)

[Benchmark construction] Benchmark construction (abstract and §3): queries are formed by grafting FDA guideline fragments onto canonical jailbreak templates, yet no validation against real query logs, expert plausibility review, or coverage of naturalistic/culturally specific phrasing is reported. This directly affects whether measured failure rates support the claim of tangible real-world risks.
[Evaluation results] Evaluation results (abstract): the manuscript states that systematic evaluation reveals three critical vulnerabilities, but provides no quantitative metrics, success rates, error bars, per-model breakdowns, or statistical tests. Without these, the magnitude and robustness of the reported vulnerabilities cannot be assessed.
[FoodGuard-4B] FoodGuard-4B (proposed mitigation section): details on the fine-tuning dataset composition, training hyperparameters, and head-to-head performance against existing guardrails on the benchmark are missing, leaving the effectiveness of the proposed solution unverified.

minor comments (2)

Clarify the precise taxonomy of food-safety principles and the exact procedure for combining FDA text with jailbreak templates.
Add explicit statements on query deduplication, length distribution, and any filtering steps applied to the 3,339 queries.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (abstract and §3): queries are formed by grafting FDA guideline fragments onto canonical jailbreak templates, yet no validation against real query logs, expert plausibility review, or coverage of naturalistic/culturally specific phrasing is reported. This directly affects whether measured failure rates support the claim of tangible real-world risks.

Authors: We agree that additional validation would strengthen claims of real-world relevance. Our construction deliberately anchors queries in FDA guidelines for factual accuracy and employs established jailbreak templates (AutoDAN, PAP) to isolate domain-specific vulnerabilities. In the revised version we will expand §3 with an explicit limitations subsection discussing the absence of proprietary query logs, the rationale for the grafting approach, and plans for future expert review and cultural coverage. We will also report the internal consistency checks performed during dataset curation. revision: partial
Referee: [Evaluation results] Evaluation results (abstract): the manuscript states that systematic evaluation reveals three critical vulnerabilities, but provides no quantitative metrics, success rates, error bars, per-model breakdowns, or statistical tests. Without these, the magnitude and robustness of the reported vulnerabilities cannot be assessed.

Authors: Section 4 already contains per-model success rates, confusion matrices, and breakdowns across the 3,339 queries. To improve clarity we will (1) revise the abstract to include headline quantitative results (e.g., average attack success rates and guardrail detection gaps) and (2) add error bars plus any applicable statistical comparisons to the existing tables and figures. revision: yes
Referee: [FoodGuard-4B] FoodGuard-4B (proposed mitigation section): details on the fine-tuning dataset composition, training hyperparameters, and head-to-head performance against existing guardrails on the benchmark are missing, leaving the effectiveness of the proposed solution unverified.

Authors: We will expand the FoodGuard-4B section with the requested details: dataset composition (size, balance of safe/unsafe examples, source distribution), full training hyperparameters (learning rate, batch size, epochs, optimizer), and direct benchmark comparisons against Llama-Guard, OpenAI moderation, and other baselines, including numerical performance deltas. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no load-bearing circular derivations

full rationale

The paper constructs FoodGuardBench from external FDA guidelines and standard jailbreak templates (AutoDAN, PAP), then reports empirical failure rates of existing LLMs and guardrails on those queries before fine-tuning FoodGuard-4B. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. Central claims rest on direct measurement rather than self-definition or self-citation chains. A score of 2 accounts for routine self-citations in related-work sections that do not carry the main results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that FDA guidelines constitute authoritative and complete coverage of food-safety hazards and that the chosen jailbreak methods (AutoDAN, PAP) are representative of realistic attacks.

axioms (1)

domain assumption FDA guidelines are treated as the authoritative source for food safety principles.
The benchmark queries are grounded in FDA guidelines without further validation against other regulatory sources or real-world incident data.

pith-pipeline@v0.9.0 · 5547 in / 1266 out tokens · 34921 ms · 2026-05-13T21:58:57.870862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce FoodGuardBench, the first comprehensive benchmark comprising 3,339 queries grounded in FDA guidelines... taxonomy of food safety principles... AutoDAN and PAP
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Current LLMs exhibit sparse safety alignment in the food-related domain... FoodGuard-4B... fine-tuned on our datasets

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 12 internal anchors

[1]

Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840,

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840,

work page arXiv
[2]

Foodsafesum: Enabling natural language processing applications for food safety document summa- rization and analysis

Juli Bakagianni, Korbinian Randl, Guido Rocchietti, Cosimo Rulli, Franco Maria Nardini, Aron Henriksson, Salvatore Trani, Anna Romanova, and John Pavlopoulos. Foodsafesum: Enabling natural language processing applications for food safety document summa- rization and analysis. InConference on Empirical Methods in Natural Language Processing (EMNLP), Novemb...

work page doi:10.18653/v1/2025.findings-emnlp.911 2025
[3]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/ 3442188.3445922. Zana Buc ¸inca, Maja Barbara Malaya, and Krzysztof Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in ai-assisted decision-making.Proc. ACM Hum. Comput. Interact., 5(CSCW1):188:1–188:21,

work page doi:10.1145/3442188.3445922
[4]

doi:10.1145/3449287 Gilad Chen, Stanley M

doi: 10.1145/3449287. URL https://doi.org/10.1145/3449287. Antonella Cavazza, Monica Mattarozzi, Arianna Franzoni, and Maria Careri. A spotlight on analytical prospects in food allergens: From emerging allergens and novel foods to bioplastics and plant-based sustainable food contact materials.Food Chemistry, 388:132951,

work page doi:10.1145/3449287
[5]

Jailbreaking Black Box Large Language Models in Twenty Queries

URL https://arxiv.org/abs/2310.08419. Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, and Yang Gao. Medsentry: Understanding and mitigating safety risks in medical llm multi-agent systems.arXiv preprint arXiv:2505.20824,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Artificial Intelligence for Food Innovation

Bianca Datta, Markus J Buehler, Yvonne Chow, Kristina Gligoric, Dan Jurafsky, David L Kaplan, Rodrigo Ledesma-Amaro, Giorgia Del Missier, Lisa Neidhardt, Karim Pichara, et al. Ai for sustainable future foods.arXiv preprint arXiv:2509.21556,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za ¨ıd Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za ¨ıd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate S...

work page 2023
[8]

Fred Fung, Huei-Shyong Wang, and Suresh Menon

URL http://papers.nips.cc/paper files/paper/2023/hash/ deb3c28192f979302c157cb653c15e90-Abstract-Conference.html. Fred Fung, Huei-Shyong Wang, and Suresh Menon. Food safety in the 21st century.Biomedi- cal journal, 41(2):88–95,

work page 2023
[9]

URLhttps://arxiv.org/abs/2406.12793. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aureli...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Shabnam Hassani, Mehrdad Sabetzadeh, and Daniel Amyot. An empirical study on llm- based classification of requirements-related provisions in food-safety regulations.Empiri- cal Software Engineering, 30(3):72,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ISBN 9798400714542

Association for Computing Machinery. ISBN 9798400714542. doi: 10.1145/3711896.3737384. URLhttps://doi.org/10.1145/3711896.3737384. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai co...

work page doi:10.1145/3711896.3737384
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

URL https: //arxiv.org/abs/2312.06674. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks T rack, 2023a....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3571730
[13]

Suyog Joshi, Soumyajit Basu, Lipika Dey, and Partha Pratim Das

URL https://arxiv.org/ abs/2603.03415. Suyog Joshi, Soumyajit Basu, Lipika Dey, and Partha Pratim Das. LLM driven legal text analytics: A case study for food safety violation cases. In Ashutosh Modi, Saptarshi Ghosh, Asif Ekbal, Pawan Goyal, Sarika Jain, Abhinav Joshi, Shivani Mishra, Debtanu Datta, Shounak Paul, Kshetrimayum Boynao Singh, and Sandeep Kum...

work page doi:10.18653/v1/2025.justnlp-main.6 2025
[14]

Scisafeeval: A comprehensive benchmark for safety alignment of large language models in scientific tasks, 2024a

Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, and Qiang Zhang. Scisafeeval: A comprehensive benchmark for safety alignment of large language models in scientific tasks, 2024a. URLhttps://arxiv.org/abs/2410.03769. Xuan Li, Zhanke Zhou,...

work page arXiv
[15]

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, et al. When helpers become hazards: A benchmark for analyzing multimodal llm-powered safety in daily life.arXiv preprint arXiv:2601.04043,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025

URLhttps://arxiv.org/abs/2507.19672. Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. InFirst Conference on Language Modeling,

work page arXiv
[17]

Dy- namic guided and domain applicable safeguards for enhanced security in large language models

Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, and Yu Li. Dy- namic guided and domain applicable safeguards for enhanced security in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

work page 2025
[18]

URLhttps://arxiv.org/abs/2402.16717. Meta AI. Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/, 2024a. Meta AI. Llama 3.3 model card and prompt formats. https://www.llama.com/docs/model- cards-and-prompt-formats/llama3, 2024b. Accessed: 2024-12-06. Meta AI. Llama acceptable use policy.https://ai.meta.com/ll...

work page arXiv 2024
[19]

doi: 10.3390/nu17223515

ISSN 2072-6643. doi: 10.3390/nu17223515. URL https://www.mdpi.com/ 2072-6643/17/22/3515. OpenAI. Gpt-4.1 system card. Technical report, OpenAI, 2024a. URL https://openai.com/ index/gpt-4-1/. OpenAI. Gpt-4o system card. Technical report, OpenAI, October 2024b. URL https: //openai.com/index/gpt-4o-system-card. OpenAI. Usage policies. https://openai.com/zh-H...

work page doi:10.3390/nu17223515 2072
[20]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[21]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Malik Sallam

URLhttps://arxiv.org/abs/2310.10501. Malik Sallam. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns.Healthcare, 11(6),

work page arXiv
[23]

doi: 10.3390/healthcare11060887

ISSN 2227-9032. doi: 10.3390/healthcare11060887. URL https://www.mdpi.com/2227-9032/11/ 6/887. Callum Sharrock, Lukas Petersson, Hanna Petersson, Axel Backlund, Axel Wennstr ¨om, Kristoffer Nordstr¨om, and Elias Aronsson. Butter-bench: Evaluating llm controlled robots for practical intelligence.arXiv preprint arXiv:2510.21860,

work page doi:10.3390/healthcare11060887
[24]

doi: 10.1145/3748239.3748242

ISSN 1931-0145. doi: 10.1145/3748239.3748242. URL https://doi.org/10.1145/3748239.3748242. Alain D. Starke, Jutta Dierkes, G ¨ulen Arslan Lied, Gloria A.B. Kasangu, and Christoph Trattner. Supporting healthier food choices through ai-tailored advice: A research agenda.PEC Innovation, 6:100372,

work page doi:10.1145/3748239.3748242 1931
[25]

doi: https://doi.org/ 15 Preprint 10.1016/j.pecinn.2025.100372

ISSN 2772-6282. doi: https://doi.org/ 15 Preprint 10.1016/j.pecinn.2025.100372. URL https://www.sciencedirect.com/science/article/ pii/S2772628225000019. Vahidullah Tac, Christopher Gardner, and Ellen Kuhl. Generative artificial intelligence creates delicious, sustainable, and nutritious burgers.arXiv preprint arXiv:2602.03092,

work page doi:10.1016/j.pecinn.2025.100372 2025
[26]

What can large language models do for sustainable food?arXiv preprint arXiv:2503.04734,

Anna T Thomas, Adam Yee, Andrew Mayne, Maya B Mathur, Dan Jurafsky, and Kristina Gligori´c. What can large language models do for sustainable food?arXiv preprint arXiv:2503.04734,

work page arXiv
[27]

Food and Drug Administration

U.S. Food and Drug Administration. Food Code 2022: Recommendations of the United States Public Health Service, Food and Drug Administration,

work page 2022
[28]

fda.gov/food/fda-food-code/food-code-2022

URL https://www. fda.gov/food/fda-food-code/food-code-2022 . Most recent version dated January 18,

work page 2022
[29]

PATeam at SemEval- 2025 task 9: LLM-augmented fusion for AI-driven food safety hazard detection

Xue Wan, Fengping Su, Ling Sun, Yuyang Lin, and Pengfei Chen. PATeam at SemEval- 2025 task 9: LLM-augmented fusion for AI-driven food safety hazard detection. In Sara Rosenthal, Aiala Ros´a, Debanjan Ghosh, and Marcos Zampieri (eds.),Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pp. 1912–1918, Vienna, Austria, July

work page 2025
[30]

ISBN 979-8-89176-273-2

Association for Computational Linguistics. ISBN 979-8-89176-273-2. URLhttps://aclanthology.org/2025.semeval-1.249/. Ke Wang, Miranda Mirosa, Yakun Hou, and Phil Bremer. Advancing food safety behavior with ai: Innovations and opportunities in the food manufacturing sector.T rends in food science & technology, 161:105050,

work page 2025
[31]

Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

work page arXiv
[32]

Language Model Cascades: Token-Level Uncertainty and Beyond

doi: 10.48550/ARXIV .2308.13387. URLhttps://doi.org/10.48550/arXiv.2308.13387. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neura...

work page internal anchor Pith review doi:10.48550/arxiv 2023
[33]

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky¨urek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, He- lena G´omez-Adorno, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American C...

work page 2024
[34]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

doi: 10.18653/V1/2024. NAACL-LONG.102. URLhttps://doi.org/10.18653/v1/2024.naacl-long.102. Zhen Xiang, Aliyah R. Hsu, Austin V . Zane, Aaron E. Kornblith, Margaret J. Lin-Martore, Jasmanpreet C. Kaur, Vasuda M. Dokiparthi, Bo Li, and Bin Yu. Cdr-agent: Intelligent selection and execution of clinical decision rules using large language model agents,

work page doi:10.18653/v1/2024 2024
[35]

16 Preprint Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen

URLhttps://arxiv.org/abs/2505.23055. 16 Preprint Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. In Findings of the Association for Computational Linguistics: NAACL 2024,

work page arXiv 2024
[36]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Zheng Xin Yong, Cristina Menghini, and Stephen Bach. Low-resource languages jailbreak GPT-4. InSocially Responsible Language Modelling Research,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

net/forum?id=pn83r8V2sv

URL https://openreview. net/forum?id=pn83r8V2sv. Huizi Yu, Lizhou Fan, Lingyao Li, Jiayan Zhou, Zihui Ma, Lu Xian, Wenyue Hua, Sijia He, Mingyu Jin, Yongfeng Zhang, Ashvin Gandhi, and Xin Ma. Large language models in biomedical and health informatics: A review with bibliometric analysis.Journal of Healthcare Informatics Research, 8(4):658–711, 12 2024a. I...

work page doi:10.1007/s41666-024-00171-8 2024
[38]

Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

doi: 10.18653/V1/2024.FINDINGS-EMNLP .79. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.79. Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, and Yanfang Ye. Food4all: A multi-agent framework for real-time free food discovery with integrated nutritional metadata.arXiv preprint arXiv:2510.18289,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp 2024
[39]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu, Rong Zhang, Shibin Wu, Shuo Zhang, Tao He, Tianyi Tang, Tingyu Xia, Wei Liao, Wei...

work page internal anchor Pith review arXiv
[40]

Weak-to-strong jailbreaking on large language models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. InForty- second International Conference on Machine Learning, 2025b. 17 Preprint Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Ch...

work page arXiv
[41]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URL https://arxiv.org/abs/2307.15043. A Appendix This appendix contains additional details for the“Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models”. The appendix is shown as follows: • §BLLM Usage Statement • §CData Construction –C.1 Data Generation –C.2 Data Distribution • §DExperiment Setting –D.1 Attack Experiment...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Based on this categorization, the final Fleiss’ Kappa among the three human experts is 0.4684

The three human experts were selected based on a pilot test using 10 random samples from the dataset. Based on this categorization, the final Fleiss’ Kappa among the three human experts is 0.4684. This demonstrates a moderate level of agreement among the three human experts. 19 Preprint Score Classification Description & Criteria 0Completely Invalid The m...

work page arXiv