Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework

Haein Kong

arxiv: 2607.00395 · v1 · pith:E4B57O4Jnew · submitted 2026-07-01 · 💻 cs.HC

Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework

Haein Kong This is my paper

Pith reviewed 2026-07-02 06:59 UTC · model grok-4.3

classification 💻 cs.HC

keywords child safetygenerative AIevaluation frameworkLlama Guardeducation domainAI incidentshazard categoriessynthetic test set

0 comments

The pith

Llama Guard models struggle to detect unsafe prompts involving children in education settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an evaluation framework for child safety in generative AI that pulls hazard categories from expert guidelines and real-world AI incident databases. It uses those categories to build a synthetic test set focused on the education domain and runs three Llama Guard models against it. The results indicate that the models often fail to flag education-related unsafe user prompts. A sympathetic reader would care because generative AI reaches children and adolescents quickly, yet most existing safety tools were designed for general adult users and overlook age-specific harms. The work shows how incident data and expert input can be combined to surface those gaps in current detection systems.

Core claim

The central claim is that integrating expert-guided risk factors with real-world AI incident data produces hazard categories that, when turned into a synthetic test set, expose limitations in existing safety models. When this framework is applied to the education domain, three Llama Guard models demonstrate clear difficulty identifying unsafe user prompts that involve children. The paper concludes that future work should extend the same approach to more risk categories and bring domain experts into the evaluation process from the beginning.

What carries the argument

The expert-guided and incident-grounded evaluation framework that extracts hazard categories from guidelines and incident databases to construct synthetic test sets for model evaluation.

If this is right

The same framework can be extended to additional risk categories beyond education.
Incorporating domain experts throughout the evaluation pipeline can improve future safety assessments.
Synthetic test sets built this way allow model evaluation without using real harmful content.
Current models such as Llama Guard require targeted improvements to handle child-specific unsafe prompts in education contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If adopted more widely, the framework could push developers to test safety classifiers against child-specific scenarios before release.
The method could be applied to other safety models besides Llama Guard to check for similar blind spots.
Linking incident databases directly to test-set creation may make safety evaluations more representative of documented harms.

Load-bearing premise

Hazard categories drawn from expert guidelines and AI incident databases accurately and comprehensively capture the child-specific risks that arise when generative AI is used in education and similar domains.

What would settle it

Running the three Llama Guard models on the constructed education-domain synthetic test set and finding that they correctly classify the large majority of unsafe prompts as unsafe would contradict the reported result.

Figures

Figures reproduced from arXiv: 2607.00395 by Haein Kong.

read the original abstract

As generative AI is increasingly used by children and adolescents, there is a growing need for risk evaluation frameworks that account for child-specific harms. However, most existing safety evaluation frameworks focus on general user populations, often overlooking risks unique to younger users. To address this gap, we propose an evaluation framework that integrates expert-guided risk factors with real-world AI incident data for child safety. The framework identifies hazard categories from expert guidelines and AI incident databases and uses this information to construct a synthetic test set for model evaluation. Particularly, we apply the framework to the education domain and evaluate three Llama Guard models on their ability to detect unsafe user prompts. Our results show that current Llama Guard models struggle to identify education-related unsafe user prompts. We conclude by discussing how future work can extend the evaluation to additional risk categories and incorporate domain experts throughout the evaluation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The framework idea makes sense for child-specific safety testing but the Llama Guard result rests on an unvalidated synthetic set with no reported construction details or realism checks.

read the letter

The paper's main point is a framework that pulls hazard categories from expert guidelines and AI incident databases, turns them into synthetic prompts focused on the education domain, and then tests Llama Guard models on those prompts. The reported finding is that the models miss education-related unsafe prompts.

What is actually new is the explicit combination of those two external sources to target child and adolescent risks rather than generic adult ones. The approach is straightforward and avoids inventing categories from scratch, which is a small but useful step.

The work does a reasonable job naming the gap in existing safety evaluations. Applying the framework to a concrete domain like education and running it on three Llama Guard versions gives a specific observation that could be useful to people building guardrails for school tools.

The soft spot is the evaluation itself. The abstract supplies no sample size, no description of how the prompts were generated or labeled, and no post-generation validation that the synthetic examples match real child interactions or that the categories are complete. The stress-test concern lands: without expert review of the prompts for realism or coverage, the performance gap could be an artifact of the test set rather than a genuine model limitation. That leaves the central claim under-supported.

This paper is for researchers working on AI safety for minors or education applications. A reader looking for category lists or a template for domain-specific testing might extract some practical ideas, but anyone wanting reproducible results would need the missing methods.

It deserves peer review once the authors add the test-set construction steps and a validation round, because the topic is timely and the basic approach has merit even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper proposes an expert-guided and incident-grounded evaluation framework for assessing child safety risks in generative AI. It extracts hazard categories from expert guidelines and AI incident databases, uses them to construct a synthetic test set focused on the education domain, and evaluates three Llama Guard models on their ability to detect unsafe user prompts, concluding that current models struggle with education-related unsafe prompts.

Significance. If the synthetic test set is shown to be representative and correctly labeled, the framework could provide a useful template for domain-specific safety evaluation that incorporates child-specific risks, addressing a gap in existing general-purpose safety benchmarks.

major comments (2)

Abstract: The headline claim that 'current Llama Guard models struggle to identify education-related unsafe user prompts' is presented without any description of test-set construction, sample size, scoring criteria, error analysis, or inter-rater reliability, rendering the result impossible to evaluate from the given text.
Abstract / framework description: The paper relies on hazard categories drawn from expert guidelines and incident databases to generate the synthetic test set, yet reports no post-generation validation (e.g., expert review for realism, coverage of actual child/adolescent education interactions, or label accuracy). Without this step the observed performance gap cannot be confidently attributed to model limitations rather than artifacts of the generation process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We respond to each major comment below and will make revisions to improve clarity and address the noted gaps.

read point-by-point responses

Referee: [—] Abstract: The headline claim that 'current Llama Guard models struggle to identify education-related unsafe user prompts' is presented without any description of test-set construction, sample size, scoring criteria, error analysis, or inter-rater reliability, rendering the result impossible to evaluate from the given text.

Authors: We agree the abstract requires more supporting detail to stand alone. The main text describes test-set construction (Section 3), reports the sample size and scoring approach (Section 4), and includes error analysis (Section 5). Inter-rater reliability is not applicable because labels are assigned deterministically from the hazard categories. We will revise the abstract to include a brief statement on test-set size, construction method, and scoring criteria. revision: yes
Referee: [—] Abstract / framework description: The paper relies on hazard categories drawn from expert guidelines and incident databases to generate the synthetic test set, yet reports no post-generation validation (e.g., expert review for realism, coverage of actual child/adolescent education interactions, or label accuracy). Without this step the observed performance gap cannot be confidently attributed to model limitations rather than artifacts of the generation process.

Authors: The manuscript presents an initial application of the framework and does not include post-generation validation. We will revise the abstract and add an explicit limitations subsection in the discussion that acknowledges the absence of expert review for realism and label accuracy. The revision will also outline how such validation can be incorporated in future extensions of the framework. revision: yes

Circularity Check

0 steps flagged

No circularity: framework draws from external guidelines and databases; results are direct model evaluations

full rationale

The paper constructs hazard categories from external expert guidelines and AI incident databases, builds a synthetic test set, and reports empirical performance of Llama Guard models on that set. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked. The central claim (models struggle on education-related prompts) is an independent measurement on the constructed set rather than a quantity forced by definition or prior self-work. The derivation chain is self-contained against the stated external sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert guidelines and incident databases together yield hazard categories that are both complete and representative for child users; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Expert guidelines and AI incident databases accurately capture child-specific hazards in generative AI
The framework is built by identifying hazard categories from these two sources.

pith-pipeline@v0.9.1-grok · 5670 in / 1192 out tokens · 29020 ms · 2026-07-02T06:59:13.261423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Gavin Abercrombie, Djalel Benbouzid, Paolo Giudici, Delaram Golpayegani, Julio Hernandez, Pierre Noro, Harshvardhan Pandit, Eva Paraschou, Charlie Pownall, Jyoti Prajapati, et al . 2024. A collaborative, human-centred taxonomy of ai, algorithmic, and automation harms.arXiv preprint arXiv:2407.01294(2024)

work page arXiv 2024
[2]

American Psychological Association. n.d.. About APA. https://www.apa.org/ about. Accessed: 2025-11-30. HEAL@CHI, April 2026, Barcelona Haein Kong

2025
[3]

Common Sense. 2025. Common Sense Media. https://www.commonsense.org/. Accessed August 11, 2025

2025
[4]

Digital Safety Research Institute. 2025. Dyff - AI Auditing Platform. https: //dyff.io/. Accessed August 8, 2025

2025
[5]

Wiebke Hutiri, Orestis Papakyriakopoulos, and Alice Xiang. 2024. Not my voice! a taxonomy of ethical and safety harms of speech generators. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 359–376

2024
[6]

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus An- derljung. 2025. Towards interactive evaluations for interaction harms in human- AI systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 1302–1310

2025
[7]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, and Amit Dhurandhar. 2025. Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions.arXiv preprint arXiv:2506.13510 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Shaun Khoo, Gabriel Chua, and Rachel Shong. 2025. MinorBench: A hand-built benchmark for content-based risks for children.arXiv preprint arXiv:2503.10242 (2025)

work page arXiv 2025
[10]

Pierre Le Jeune, Jiaen Liu, Luca Rossi, and Matteo Dora. 2025. Realharm: A collection of real-world language model application failures. InProceedings of the The First Workshop on LLM Security (LLMSEC). 87–100

2025
[11]

Hao-Ping Lee, Yu-Ju Yang, Thomas Serban Von Davier, Jodi Forlizzi, and Sauvik Das. 2024. Deepfakes, phrenology, surveillance, and more! a taxonomy of ai privacy risks. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

2024
[12]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

2022
[13]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Robb and Supreet Mann

Michael B. Robb and Supreet Mann. 2025.Talk, Trust, and Trade-Offs: How and Why Teens Use AI Companions. Technical Report. Common Sense Media, San Francisco, CA

2025
[15]

Peter Slattery, Alexander K Saeri, Emily AC Grundy, Jess Graham, Michael Noetel, Risto Uuk, James Dao, Soroush Pour, Stephen Casper, and Neil Thompson. 2024. The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence.arXiv preprint arXiv:2408.12622(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

The Safe AI For Children Alliance. 2025. About The Safe AI for Children Alliance. https://www.safeaiforchildren.org. Accessed August 11, 2025

2025
[17]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al
[18]

InProceedings of the 2022 ACM conference on fairness, accountability, and transparency

Taxonomy of risks posed by language models. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 214–229

2022
[19]

solve this problem

Yaman Yu, Yiren Liu, Jacky Zhang, Yun Huang, and Yang Wang. 2025. Under- standing Generative AI Risks for Youth: A Taxonomy Based on Empirical Data. arXiv preprint arXiv:2502.16383(2025). A Prompts for Test Set Generation Mistral-7B-Instruct model was used to generate a test set consisting of both safe and unsafe user requests in educational contexts. To ...

work page arXiv 2025

[1] [1]

Gavin Abercrombie, Djalel Benbouzid, Paolo Giudici, Delaram Golpayegani, Julio Hernandez, Pierre Noro, Harshvardhan Pandit, Eva Paraschou, Charlie Pownall, Jyoti Prajapati, et al . 2024. A collaborative, human-centred taxonomy of ai, algorithmic, and automation harms.arXiv preprint arXiv:2407.01294(2024)

work page arXiv 2024

[2] [2]

American Psychological Association. n.d.. About APA. https://www.apa.org/ about. Accessed: 2025-11-30. HEAL@CHI, April 2026, Barcelona Haein Kong

2025

[3] [3]

Common Sense. 2025. Common Sense Media. https://www.commonsense.org/. Accessed August 11, 2025

2025

[4] [4]

Digital Safety Research Institute. 2025. Dyff - AI Auditing Platform. https: //dyff.io/. Accessed August 8, 2025

2025

[5] [5]

Wiebke Hutiri, Orestis Papakyriakopoulos, and Alice Xiang. 2024. Not my voice! a taxonomy of ethical and safety harms of speech generators. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 359–376

2024

[6] [6]

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus An- derljung. 2025. Towards interactive evaluations for interaction harms in human- AI systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 1302–1310

2025

[7] [7]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, and Amit Dhurandhar. 2025. Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions.arXiv preprint arXiv:2506.13510 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Shaun Khoo, Gabriel Chua, and Rachel Shong. 2025. MinorBench: A hand-built benchmark for content-based risks for children.arXiv preprint arXiv:2503.10242 (2025)

work page arXiv 2025

[10] [10]

Pierre Le Jeune, Jiaen Liu, Luca Rossi, and Matteo Dora. 2025. Realharm: A collection of real-world language model application failures. InProceedings of the The First Workshop on LLM Security (LLMSEC). 87–100

2025

[11] [11]

Hao-Ping Lee, Yu-Ju Yang, Thomas Serban Von Davier, Jodi Forlizzi, and Sauvik Das. 2024. Deepfakes, phrenology, surveillance, and more! a taxonomy of ai privacy risks. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

2024

[12] [12]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

2022

[13] [13]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Robb and Supreet Mann

Michael B. Robb and Supreet Mann. 2025.Talk, Trust, and Trade-Offs: How and Why Teens Use AI Companions. Technical Report. Common Sense Media, San Francisco, CA

2025

[15] [15]

Peter Slattery, Alexander K Saeri, Emily AC Grundy, Jess Graham, Michael Noetel, Risto Uuk, James Dao, Soroush Pour, Stephen Casper, and Neil Thompson. 2024. The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence.arXiv preprint arXiv:2408.12622(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

The Safe AI For Children Alliance. 2025. About The Safe AI for Children Alliance. https://www.safeaiforchildren.org. Accessed August 11, 2025

2025

[17] [17]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al

[18] [18]

InProceedings of the 2022 ACM conference on fairness, accountability, and transparency

Taxonomy of risks posed by language models. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 214–229

2022

[19] [19]

solve this problem

Yaman Yu, Yiren Liu, Jacky Zhang, Yun Huang, and Yang Wang. 2025. Under- standing Generative AI Risks for Youth: A Taxonomy Based on Empirical Data. arXiv preprint arXiv:2502.16383(2025). A Prompts for Test Set Generation Mistral-7B-Instruct model was used to generate a test set consisting of both safe and unsafe user requests in educational contexts. To ...

work page arXiv 2025