Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework
Pith reviewed 2026-07-02 06:59 UTC · model grok-4.3
The pith
Llama Guard models struggle to detect unsafe prompts involving children in education settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that integrating expert-guided risk factors with real-world AI incident data produces hazard categories that, when turned into a synthetic test set, expose limitations in existing safety models. When this framework is applied to the education domain, three Llama Guard models demonstrate clear difficulty identifying unsafe user prompts that involve children. The paper concludes that future work should extend the same approach to more risk categories and bring domain experts into the evaluation process from the beginning.
What carries the argument
The expert-guided and incident-grounded evaluation framework that extracts hazard categories from guidelines and incident databases to construct synthetic test sets for model evaluation.
If this is right
- The same framework can be extended to additional risk categories beyond education.
- Incorporating domain experts throughout the evaluation pipeline can improve future safety assessments.
- Synthetic test sets built this way allow model evaluation without using real harmful content.
- Current models such as Llama Guard require targeted improvements to handle child-specific unsafe prompts in education contexts.
Where Pith is reading between the lines
- If adopted more widely, the framework could push developers to test safety classifiers against child-specific scenarios before release.
- The method could be applied to other safety models besides Llama Guard to check for similar blind spots.
- Linking incident databases directly to test-set creation may make safety evaluations more representative of documented harms.
Load-bearing premise
Hazard categories drawn from expert guidelines and AI incident databases accurately and comprehensively capture the child-specific risks that arise when generative AI is used in education and similar domains.
What would settle it
Running the three Llama Guard models on the constructed education-domain synthetic test set and finding that they correctly classify the large majority of unsafe prompts as unsafe would contradict the reported result.
Figures
read the original abstract
As generative AI is increasingly used by children and adolescents, there is a growing need for risk evaluation frameworks that account for child-specific harms. However, most existing safety evaluation frameworks focus on general user populations, often overlooking risks unique to younger users. To address this gap, we propose an evaluation framework that integrates expert-guided risk factors with real-world AI incident data for child safety. The framework identifies hazard categories from expert guidelines and AI incident databases and uses this information to construct a synthetic test set for model evaluation. Particularly, we apply the framework to the education domain and evaluate three Llama Guard models on their ability to detect unsafe user prompts. Our results show that current Llama Guard models struggle to identify education-related unsafe user prompts. We conclude by discussing how future work can extend the evaluation to additional risk categories and incorporate domain experts throughout the evaluation pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an expert-guided and incident-grounded evaluation framework for assessing child safety risks in generative AI. It extracts hazard categories from expert guidelines and AI incident databases, uses them to construct a synthetic test set focused on the education domain, and evaluates three Llama Guard models on their ability to detect unsafe user prompts, concluding that current models struggle with education-related unsafe prompts.
Significance. If the synthetic test set is shown to be representative and correctly labeled, the framework could provide a useful template for domain-specific safety evaluation that incorporates child-specific risks, addressing a gap in existing general-purpose safety benchmarks.
major comments (2)
- Abstract: The headline claim that 'current Llama Guard models struggle to identify education-related unsafe user prompts' is presented without any description of test-set construction, sample size, scoring criteria, error analysis, or inter-rater reliability, rendering the result impossible to evaluate from the given text.
- Abstract / framework description: The paper relies on hazard categories drawn from expert guidelines and incident databases to generate the synthetic test set, yet reports no post-generation validation (e.g., expert review for realism, coverage of actual child/adolescent education interactions, or label accuracy). Without this step the observed performance gap cannot be confidently attributed to model limitations rather than artifacts of the generation process.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We respond to each major comment below and will make revisions to improve clarity and address the noted gaps.
read point-by-point responses
-
Referee: [—] Abstract: The headline claim that 'current Llama Guard models struggle to identify education-related unsafe user prompts' is presented without any description of test-set construction, sample size, scoring criteria, error analysis, or inter-rater reliability, rendering the result impossible to evaluate from the given text.
Authors: We agree the abstract requires more supporting detail to stand alone. The main text describes test-set construction (Section 3), reports the sample size and scoring approach (Section 4), and includes error analysis (Section 5). Inter-rater reliability is not applicable because labels are assigned deterministically from the hazard categories. We will revise the abstract to include a brief statement on test-set size, construction method, and scoring criteria. revision: yes
-
Referee: [—] Abstract / framework description: The paper relies on hazard categories drawn from expert guidelines and incident databases to generate the synthetic test set, yet reports no post-generation validation (e.g., expert review for realism, coverage of actual child/adolescent education interactions, or label accuracy). Without this step the observed performance gap cannot be confidently attributed to model limitations rather than artifacts of the generation process.
Authors: The manuscript presents an initial application of the framework and does not include post-generation validation. We will revise the abstract and add an explicit limitations subsection in the discussion that acknowledges the absence of expert review for realism and label accuracy. The revision will also outline how such validation can be incorporated in future extensions of the framework. revision: yes
Circularity Check
No circularity: framework draws from external guidelines and databases; results are direct model evaluations
full rationale
The paper constructs hazard categories from external expert guidelines and AI incident databases, builds a synthetic test set, and reports empirical performance of Llama Guard models on that set. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked. The central claim (models struggle on education-related prompts) is an independent measurement on the constructed set rather than a quantity forced by definition or prior self-work. The derivation chain is self-contained against the stated external sources.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert guidelines and AI incident databases accurately capture child-specific hazards in generative AI
Reference graph
Works this paper leans on
-
[1]
Gavin Abercrombie, Djalel Benbouzid, Paolo Giudici, Delaram Golpayegani, Julio Hernandez, Pierre Noro, Harshvardhan Pandit, Eva Paraschou, Charlie Pownall, Jyoti Prajapati, et al . 2024. A collaborative, human-centred taxonomy of ai, algorithmic, and automation harms.arXiv preprint arXiv:2407.01294(2024)
-
[2]
American Psychological Association. n.d.. About APA. https://www.apa.org/ about. Accessed: 2025-11-30. HEAL@CHI, April 2026, Barcelona Haein Kong
2025
-
[3]
Common Sense. 2025. Common Sense Media. https://www.commonsense.org/. Accessed August 11, 2025
2025
-
[4]
Digital Safety Research Institute. 2025. Dyff - AI Auditing Platform. https: //dyff.io/. Accessed August 8, 2025
2025
-
[5]
Wiebke Hutiri, Orestis Papakyriakopoulos, and Alice Xiang. 2024. Not my voice! a taxonomy of ethical and safety harms of speech generators. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 359–376
2024
-
[6]
Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus An- derljung. 2025. Towards interactive evaluations for interaction harms in human- AI systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 1302–1310
2025
-
[7]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, and Amit Dhurandhar. 2025. Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions.arXiv preprint arXiv:2506.13510 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
Pierre Le Jeune, Jiaen Liu, Luca Rossi, and Matteo Dora. 2025. Realharm: A collection of real-world language model application failures. InProceedings of the The First Workshop on LLM Security (LLMSEC). 87–100
2025
-
[11]
Hao-Ping Lee, Yu-Ju Yang, Thomas Serban Von Davier, Jodi Forlizzi, and Sauvik Das. 2024. Deepfakes, phrenology, surveillance, and more! a taxonomy of ai privacy risks. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19
2024
-
[12]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252
2022
-
[13]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Robb and Supreet Mann
Michael B. Robb and Supreet Mann. 2025.Talk, Trust, and Trade-Offs: How and Why Teens Use AI Companions. Technical Report. Common Sense Media, San Francisco, CA
2025
-
[15]
Peter Slattery, Alexander K Saeri, Emily AC Grundy, Jess Graham, Michael Noetel, Risto Uuk, James Dao, Soroush Pour, Stephen Casper, and Neil Thompson. 2024. The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence.arXiv preprint arXiv:2408.12622(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
The Safe AI For Children Alliance. 2025. About The Safe AI for Children Alliance. https://www.safeaiforchildren.org. Accessed August 11, 2025
2025
-
[17]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al
-
[18]
InProceedings of the 2022 ACM conference on fairness, accountability, and transparency
Taxonomy of risks posed by language models. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 214–229
2022
-
[19]
Yaman Yu, Yiren Liu, Jacky Zhang, Yun Huang, and Yang Wang. 2025. Under- standing Generative AI Risks for Youth: A Taxonomy Based on Empirical Data. arXiv preprint arXiv:2502.16383(2025). A Prompts for Test Set Generation Mistral-7B-Instruct model was used to generate a test set consisting of both safe and unsafe user requests in educational contexts. To ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.