SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users

Fengpei Yuan; Madhava Kalyan Gadiputi; Wenzheng Zhao

arxiv: 2604.03264 · v1 · submitted 2026-03-12 · 💻 cs.CV · cs.AI· cs.CR

SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users

Wenzheng Zhao , Madhava Kalyan Gadiputi , Fengpei Yuan This is my paper

Pith reviewed 2026-05-15 11:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR

keywords safety-first screeningpersonalized video retrievalvulnerable usersdementia caremultimodal analysisLLM decision makingadaptive question generation

0 comments

The pith

SafeScreen screens videos against each user's individual safety rules before any content is shown.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SafeScreen as a pipeline that extracts safety criteria from a user profile, generates adaptive questions about candidate videos, analyzes them multimodally, and lets an LLM approve or reject each video in sequence. This approach treats safety compliance as a hard prerequisite rather than a post-ranking filter. Standard platforms optimize for engagement and can surface unsuitable material for children or dementia patients; SafeScreen instead produces a shortlist that already satisfies the profile constraints. If the method works, open video repositories become usable in care and education settings without requiring pre-labeled safe content or manual review for every user.

Core claim

SafeScreen retrieves and presents personalized videos by first deriving individualized safety criteria from a user profile, then performing sequential approval through adaptive question generation, multimodal VideoRAG evidence collection, and LLM-based verification of safety, appropriateness, and relevance; the result is an explainable decision for each candidate that prioritizes constraint satisfaction over engagement signals.

What carries the argument

The sequential approval pipeline that extracts profile-driven safety criteria and verifies them via adaptive question generation plus multimodal video analysis before any exposure occurs.

If this is right

Candidate videos are approved or rejected one at a time rather than ranked by popularity or relevance.
The output list diverges from engagement-optimized rankings in the large majority of test cases.
Safety, sensibleness, and groundedness scores remain high when checked by both automated and human evaluators.
The method works on uncurated repositories without needing precomputed safety labels for each video.
The same pipeline supports different care contexts by swapping the profile criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to other domains such as educational video selection for young learners if new safety criteria are defined.
Real-time profile updates would allow the screening decisions to adapt as a user's needs or sensitivities change over time.
Integration into existing platforms would shift the default from engagement-first to constraint-first retrieval for designated vulnerable accounts.

Load-bearing premise

LLM-based decisions guided by adaptive questions and multimodal analysis will catch harmful content and avoid approving unsafe videos for the specific user profile.

What would settle it

A controlled test in which domain experts review a set of videos containing subtle risks and check whether the system approves any of those videos or rejects clearly safe ones that meet the stated profile criteria.

Figures

Figures reproduced from arXiv: 2604.03264 by Fengpei Yuan, Madhava Kalyan Gadiputi, Wenzheng Zhao.

**Figure 2.** Figure 2: Complete SafeScreen framework overview showing the three-stage pipeline: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: SafeScreen deployment contexts: clinical integration (left) and systematic evaluation (right). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube's engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeScreen gives a concrete pipeline for safety-first video screening in care settings but its claims rest on synthetic data and LLM self-evaluation that undercuts the reliability.

read the letter

The main thing here is a practical sequential pipeline that pulls individualized safety rules from a user profile, then uses adaptive questions plus VideoRAG to check candidate videos before any exposure. That setup is new enough in the vulnerable-user context and it does a clean job of making safety the gate rather than a post-hoc filter. The dementia-care case study shows the system diverging from YouTube rankings in most trials while keeping the internal safety and sensibleness scores high, which is useful to see even if the numbers are preliminary. The evaluation stays limited to 30 synthetic profiles and 90 queries, with no real patients or caregivers involved and no external baseline beyond engagement divergence. The safety judgments themselves come from the same LLM family that runs the pipeline, so the high coverage and groundedness numbers risk confirming the model's own reasoning patterns instead of catching real misses. Domain-expert validation is mentioned but not detailed enough to offset that. This work is aimed at people building recommendation systems for health or education domains who need a starting template for constraint-driven retrieval. It deserves a serious referee because the problem is real and the architecture is straightforward to test further, but any acceptance should require real-user data and independent safety labels before the results can be treated as evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces SafeScreen, a safety-first screening framework for personalized video retrieval aimed at vulnerable users (e.g., dementia care). It extracts individualized safety criteria from user profiles, performs evidence-grounded assessments via adaptive question generation and multimodal VideoRAG, and uses LLM-based decision-making to approve or reject candidate videos before exposure. Rather than optimizing for engagement, the system treats safety as a prerequisite. Evaluation on 30 synthetic patient profiles and 90 test queries reports 80-93% divergence from YouTube's engagement-optimized rankings while claiming high safety coverage, sensibleness, and groundedness, validated by LLM-as-judge metrics and domain experts.

Significance. If the core pipeline reliably enforces individualized constraints without missing harmful content, SafeScreen could enable safer deployment of open video platforms in caregiving and educational settings. The design's emphasis on explainable, real-time screening without precomputed labels is a constructive contribution, but the current evaluation's dependence on synthetic profiles and internal LLM judgments provides limited evidence that the approach generalizes to real individualized safety needs.

major comments (2)

[Evaluation] Evaluation section: safety coverage, sensibleness, and groundedness are defined and scored by the same LLM pipeline used in the screening system itself, creating circularity that does not independently measure missed harmful content or false approvals on the 90 test queries.
[Evaluation] Evaluation section: the headline claim of reliable individualized safety verification rests on 30 synthetic profiles and LLM/expert judgments with no real-user validation, no baseline comparisons to other safety filters, and no quantification of LLM judgment error, which is load-bearing for the assertion that the framework works in dementia-care settings.

minor comments (2)

[Abstract] Abstract and Evaluation: the 80-93% divergence range should be reported with per-profile or per-query breakdowns and confidence intervals rather than as a single aggregate.
[Evaluation] The manuscript should clarify the exact prompting strategy and model versions used for both the screening pipeline and the LLM-as-judge evaluation to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the evaluation of SafeScreen. We address each major comment point by point below, with revisions incorporated where feasible to improve clarity and evidence.

read point-by-point responses

Referee: [Evaluation] Evaluation section: safety coverage, sensibleness, and groundedness are defined and scored by the same LLM pipeline used in the screening system itself, creating circularity that does not independently measure missed harmful content or false approvals on the 90 test queries.

Authors: We acknowledge the risk of circularity when the same LLM pipeline contributes to both screening decisions and automated evaluation metrics. The original manuscript already includes independent validation by domain experts on a subset of the 90 queries, which we have now expanded in the revised version with a dedicated subsection detailing expert agreement rates, inter-rater reliability, and specific cases where expert review overrode or confirmed LLM outputs. This provides an external check on missed harmful content and false approvals. We have also added a limitations paragraph discussing LLM-as-judge biases. revision: partial
Referee: [Evaluation] Evaluation section: the headline claim of reliable individualized safety verification rests on 30 synthetic profiles and LLM/expert judgments with no real-user validation, no baseline comparisons to other safety filters, and no quantification of LLM judgment error, which is load-bearing for the assertion that the framework works in dementia-care settings.

Authors: We agree that reliance on 30 synthetic profiles constitutes a limitation for claims about real dementia-care deployment. The revised manuscript now includes explicit baseline comparisons against rule-based keyword filters and simple multimodal classifiers, with quantitative results showing SafeScreen's divergence and safety gains. We have also added quantification of LLM judgment error via agreement statistics with the domain experts (e.g., Cohen's kappa and disagreement cases). Real-user validation with vulnerable populations is not feasible within the scope of this work due to ethical and IRB constraints; we explicitly frame the current study as a controlled proof-of-concept and outline planned clinical trials as future work. revision: partial

standing simulated objections not resolved

Real-user validation with actual vulnerable users (e.g., dementia patients) due to ethical and regulatory requirements

Circularity Check

1 steps flagged

LLM-based evaluation of safety decisions shares the same model class as the screening pipeline, risking circular overestimation of reliability

specific steps

other [Abstract (Results paragraph)]
"Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube's engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts."

Safety coverage, sensibleness, and groundedness are defined and scored via the same LLM-based decision-making and adaptive question generation used inside the SafeScreen pipeline itself; the evaluator therefore risks reproducing the pipeline's own reasoning patterns rather than providing an independent check on missed harmful content or false approvals.

full rationale

The paper's central results (80-93% divergence from YouTube plus high safety coverage/sensibleness/groundedness) rest on LLM-as-judge validation of the outputs produced by an LLM-driven pipeline (profile extraction, adaptive question generation, VideoRAG analysis, and decision-making). While divergence from YouTube rankings can be measured externally, the safety metrics are generated and scored inside the same LLM reasoning loop on synthetic profiles, creating partial circularity in the validation of individualized safety enforcement. No equations or self-citations reduce the derivation by construction, so the circularity is moderate rather than total.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the untested premise that current multimodal LLMs can perform reliable individualized safety verification; no free parameters or new entities are introduced, but the core decision step rests on domain assumptions about LLM capability.

axioms (1)

domain assumption LLMs can generate accurate, grounded safety and appropriateness judgments from video content and user profiles
Invoked in the LLM-based decision-making component and evaluation validation without external calibration or error bounds.

pith-pipeline@v0.9.0 · 5561 in / 1342 out tokens · 30008 ms · 2026-05-15T11:20:52.087722+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al

work page
[2]

InAdvances in Neural Information Processing Systems, Vol

Flamingo: A visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, Vol. 35. 23716–23736

work page
[3]

Jan Batzner, Volker Stocker, Bingjun Tang, Anusha Natarajan, Qinhao Chen, Stefan Schmid, and Gjergji Kasneci. 2025. Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency. InProceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society. AAAI, 343–354

work page 2025
[4]

Collins, Karel G

Gary S. Collins, Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, et al. 2024. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods.BMJ385 (2024), e078378. doi:10.1136/bmj-2024-078378

work page doi:10.1136/bmj-2024-078378 2024
[5]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. InProceedings of the Tenth ACM Conference on Recommender Systems. 191–198. doi:10.1145/2959100.2959190

work page doi:10.1145/2959100.2959190 2016
[6]

Norah L Crossnohere, Mohamed Elsaid, Jonathan Paskett, Seuli Bose-Brill, and John F P Bridges. 2022. Guidelines for artificial intelligence in medicine: literature review and content analysis of frameworks.Journal of Medical Internet Research 24, 8 (2022), e36823. doi:10.2196/36823

work page doi:10.2196/36823 2022
[7]

James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath. 2010. The YouTube video recommendation system. InProceedings of the Fourth ACM Conference on Recommender Systems. 293–296. doi:10.1145/1864708. 1864770

work page doi:10.1145/1864708 2010
[8]

Anne A H de Hond, Artuur M Leeuwenberg, Lotty Hooft, Ilse M J Kant, Steven W J Nijman, Hendrikus J A van Os, Jiska J Aardoom, Thomas P A Debray, Ewoud Schuit, Maarten van Smeden, Johannes B Reitsma, Ewout W Steyerberg, Niels H Chavannes, and Karel G M Moons. 2022. Guidelines and quality criteria for artificial intelligence-based prediction models in healt...

work page doi:10.1038/s41746-021-00549-7 2022
[9]

JMIR Mental Health 4(2), e19 (2017).https://doi.org/10.2196/mental.7785 SLIP & ETHICS: Graduated Intervention for AI Emotional Companions 11

Kathleen K. Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial.JMIR Mental Health4, 2 (2017), e19. doi:10.2196/mental.7785

work page doi:10.2196/mental.7785 2017
[10]

Google LLC. 2015. YouTube Kids. https://www.youtubekids.com/. Accessed: 2025-01

work page 2015
[11]

Robert Gorwa, Reuben Binns, and Christian Katzenbach. 2020. Algorithmic content moderation: Technical and political challenges in the automation of platform governance.Big Data & Society7, 1 (2020), 2053951719897945. doi:10. 1177/2053951719897945

work page 2020
[12]

Kilem L. Gwet. 2014.Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters(4 ed.). Advanced Analytics, LLC, Gaithersburg, MD

work page 2014
[13]

Becky Inkster, Shubham Sarda, and Vinod Subramanian. 2018. An empathy- driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study.JMIR mHealth and uHealth6, 11 (2018), e12106. doi:10.2196/mhealth.9785

work page doi:10.2196/mhealth.9785 2018
[14]

Rishabh Kaushal, Jacob van de Kerkhof, Catalina Goanta, Gerasimos Spanakis, and Adriana Iamnitchi. 2024. Automated Transparency: A Legal and Empirical Analysis of the Digital Services Act Transparency Database. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 1121–1132. doi:10.1145/3630106.3658960

work page doi:10.1145/3630106.3658960 2024
[15]

Jean-Baptiste Lamy, Abdelmalek Mouazer, Romain Léguillon, Romain Lelong, Stéfan J Darmoni, Karima Sedki, Sophie Dubois, and Hector Falcoff. 2024. Adap- tive questionnaires for facilitating patient data entry in clinical decision support systems: methods and application to STOPP/START v2.BMC Medical Informatics and Decision Making24, 1 (2024), 326. doi:10....

work page doi:10.1186/s12911-024-02742-6 2024
[16]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agree- ment for categorical data.Biometrics33, 1 (1977), 159–174

work page 1977
[17]

Amanda Lazar, Caroline Edasis, and Anne Marie Piper. 2017. A critical lens on dementia and design in HCI. InProceedings of the CHI Conference on Human Factors in Computing Systems. 2175–2188. doi:10.1145/3025453.3025638

work page doi:10.1145/3025453.3025638 2017
[18]

Hao Li, Shuai Wu, Haoran Zheng, Xiaobo Jiang, Bo Jiang, and Chao Zhao. 2024. LLMs-as-judges: A comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Adekeye, Daniel Berish, Feng Yuan, and Xiaopeng Zhao

Yu-Ju Liao, Yu-Ling Jao, Marie Boltz, Olusegun T. Adekeye, Daniel Berish, Feng Yuan, and Xiaopeng Zhao. 2023. Use of a humanoid robot in supporting dementia care: A qualitative analysis.SAGE Open Nursing9 (2023), 23779608231179528. doi:10.1177/23779608231179528

work page doi:10.1177/23779608231179528 2023
[20]

Sonia Livingstone and Ellen J. Helsper. 2008. Parental mediation of children’s internet use.Journal of Broadcasting & Electronic Media52, 4 (2008), 581–599. doi:10.1080/08838150802437396

work page doi:10.1080/08838150802437396 2008
[21]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919

work page 2020
[22]

Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi

work page
[23]

Offsetbias: Leveraging debiased data for tuning evaluators.arXiv preprint arXiv:2407.06551(2024)

work page arXiv 2024
[24]

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang

work page
[25]

Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos. arXiv:2502.01549 [cs.IR] https://arxiv.org/abs/2502.01549

work page arXiv
[26]

Anna Riedmann, Philipp Schaper, and Birgit Lugrin. 2025. Reinforcement learning in education: A systematic literature review.International Journal of Artificial Intelligence in Education35 (2025), 1–65. doi:10.1007/s40593-025-00494-6

work page doi:10.1007/s40593-025-00494-6 2025
[27]

Landon Ring, Liyan Shi, Kayla Totzke, and Timothy Bickmore. 2015. Social support agents for older adults: Longitudinal affective computing in the home. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction. 551–557. doi:10.1109/ACII.2015.7344662

work page doi:10.1109/acii.2015.7344662 2015
[28]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid

work page
[29]

InProceedings of the IEEE/CVF International Conference on Computer Vision

VideoBERT: A joint model for video and language representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7463–

work page
[30]

doi:10.1109/ICCV.2019.00757

work page doi:10.1109/iccv.2019.00757 2019
[31]

Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2023. Self-supervised learning for multimedia recommenda- tion.IEEE Transactions on Multimedia25 (2023), 5107–5116. doi:10.1109/TMM. 2022.3177882

work page doi:10.1109/tmm 2023
[32]

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul- shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vince...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Jun Wang and Ying Zhao. 2022. Affective video content analysis and recommen- dation: A survey.IEEE Access10 (2022), 126430–126447. doi:10.1109/ACCESS. 2022.3195050

work page doi:10.1109/access 2022
[34]

Qifan Wang, Yinwei Wei, Jianhua Yin, Jianwei Wu, Xuemeng Song, and Liqiang Nie. 2023. DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia25 (2023), 1074–1084. doi:10.1109/TMM.2021. 3138298

work page doi:10.1109/tmm.2021 2023
[35]

Feng Yuan, Rui Zhang, Dania Bilal, and Xiaopeng Zhao. 2021. Learning-based strategy design for robot-assisted reminiscence therapy based on a developed model for people with dementia. InProceedings of the International Conference on Social Robotics. 432–442. doi:10.1007/978-3-030-85717-1_42

work page doi:10.1007/978-3-030-85717-1_42 2021
[36]

Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck

work page
[37]

Dirty clicks: A study of the usability and security implications of click-related behaviors on the web

Generating clarifying questions for information retrieval. InProceedings of The Web Conference 2020. ACM, 418–428. doi:10.1145/3366423.3380126 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhao et al

work page doi:10.1145/3366423.3380126 2020
[38]

Yongfeng Zhang and Xu Chen. 2020. Explainable recommendation: A survey and new perspectives.Foundations and Trends in Information Retrieval14, 1 (2020), 1–101. doi:10.1561/1500000071

work page doi:10.1561/1500000071 2020
[39]

Wenzheng Zhao. 2026. An Edge–Host–Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care.IEEE Transactions on Robotics (T-RO)(2026)

work page 2026
[40]

based on the specific clinical scenario

Wenzheng Zhao, Kruthika Gangaraju, and Fengpei Yuan. 2025. Multimodal Perception-Driven Decision-Making for Human-Robot Interaction: A Survey. Frontiers in Robotics and AI12 (2025), 1604472. A Implementation and Execution Protocol SafeScreen operates across multiple environments: GPT-4 API for profile extraction, risk detection, and question generation; N...

work page 2025
[41]

car videos

avoids accuracy thresholds, acknowledging metrics vary by content type and harm severity; for vulnerable populations, false negatives (showing harmful content) carry greater risk than false positives (over-cautious rejection). B.2 Hybrid AI-Human Evaluation Approach Following validation methodologies for LLM-as-a-judge frame- works [17, 21], we employ hyb...

work page 2018

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al

work page

[2] [2]

InAdvances in Neural Information Processing Systems, Vol

Flamingo: A visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, Vol. 35. 23716–23736

work page

[3] [3]

Jan Batzner, Volker Stocker, Bingjun Tang, Anusha Natarajan, Qinhao Chen, Stefan Schmid, and Gjergji Kasneci. 2025. Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency. InProceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society. AAAI, 343–354

work page 2025

[4] [4]

Collins, Karel G

Gary S. Collins, Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, et al. 2024. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods.BMJ385 (2024), e078378. doi:10.1136/bmj-2024-078378

work page doi:10.1136/bmj-2024-078378 2024

[5] [5]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. InProceedings of the Tenth ACM Conference on Recommender Systems. 191–198. doi:10.1145/2959100.2959190

work page doi:10.1145/2959100.2959190 2016

[6] [6]

Norah L Crossnohere, Mohamed Elsaid, Jonathan Paskett, Seuli Bose-Brill, and John F P Bridges. 2022. Guidelines for artificial intelligence in medicine: literature review and content analysis of frameworks.Journal of Medical Internet Research 24, 8 (2022), e36823. doi:10.2196/36823

work page doi:10.2196/36823 2022

[7] [7]

James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath. 2010. The YouTube video recommendation system. InProceedings of the Fourth ACM Conference on Recommender Systems. 293–296. doi:10.1145/1864708. 1864770

work page doi:10.1145/1864708 2010

[8] [8]

Anne A H de Hond, Artuur M Leeuwenberg, Lotty Hooft, Ilse M J Kant, Steven W J Nijman, Hendrikus J A van Os, Jiska J Aardoom, Thomas P A Debray, Ewoud Schuit, Maarten van Smeden, Johannes B Reitsma, Ewout W Steyerberg, Niels H Chavannes, and Karel G M Moons. 2022. Guidelines and quality criteria for artificial intelligence-based prediction models in healt...

work page doi:10.1038/s41746-021-00549-7 2022

[9] [9]

JMIR Mental Health 4(2), e19 (2017).https://doi.org/10.2196/mental.7785 SLIP & ETHICS: Graduated Intervention for AI Emotional Companions 11

Kathleen K. Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial.JMIR Mental Health4, 2 (2017), e19. doi:10.2196/mental.7785

work page doi:10.2196/mental.7785 2017

[10] [10]

Google LLC. 2015. YouTube Kids. https://www.youtubekids.com/. Accessed: 2025-01

work page 2015

[11] [11]

Robert Gorwa, Reuben Binns, and Christian Katzenbach. 2020. Algorithmic content moderation: Technical and political challenges in the automation of platform governance.Big Data & Society7, 1 (2020), 2053951719897945. doi:10. 1177/2053951719897945

work page 2020

[12] [12]

Kilem L. Gwet. 2014.Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters(4 ed.). Advanced Analytics, LLC, Gaithersburg, MD

work page 2014

[13] [13]

Becky Inkster, Shubham Sarda, and Vinod Subramanian. 2018. An empathy- driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study.JMIR mHealth and uHealth6, 11 (2018), e12106. doi:10.2196/mhealth.9785

work page doi:10.2196/mhealth.9785 2018

[14] [14]

Rishabh Kaushal, Jacob van de Kerkhof, Catalina Goanta, Gerasimos Spanakis, and Adriana Iamnitchi. 2024. Automated Transparency: A Legal and Empirical Analysis of the Digital Services Act Transparency Database. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 1121–1132. doi:10.1145/3630106.3658960

work page doi:10.1145/3630106.3658960 2024

[15] [15]

Jean-Baptiste Lamy, Abdelmalek Mouazer, Romain Léguillon, Romain Lelong, Stéfan J Darmoni, Karima Sedki, Sophie Dubois, and Hector Falcoff. 2024. Adap- tive questionnaires for facilitating patient data entry in clinical decision support systems: methods and application to STOPP/START v2.BMC Medical Informatics and Decision Making24, 1 (2024), 326. doi:10....

work page doi:10.1186/s12911-024-02742-6 2024

[16] [16]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agree- ment for categorical data.Biometrics33, 1 (1977), 159–174

work page 1977

[17] [17]

Amanda Lazar, Caroline Edasis, and Anne Marie Piper. 2017. A critical lens on dementia and design in HCI. InProceedings of the CHI Conference on Human Factors in Computing Systems. 2175–2188. doi:10.1145/3025453.3025638

work page doi:10.1145/3025453.3025638 2017

[18] [18]

Hao Li, Shuai Wu, Haoran Zheng, Xiaobo Jiang, Bo Jiang, and Chao Zhao. 2024. LLMs-as-judges: A comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Adekeye, Daniel Berish, Feng Yuan, and Xiaopeng Zhao

Yu-Ju Liao, Yu-Ling Jao, Marie Boltz, Olusegun T. Adekeye, Daniel Berish, Feng Yuan, and Xiaopeng Zhao. 2023. Use of a humanoid robot in supporting dementia care: A qualitative analysis.SAGE Open Nursing9 (2023), 23779608231179528. doi:10.1177/23779608231179528

work page doi:10.1177/23779608231179528 2023

[20] [20]

Sonia Livingstone and Ellen J. Helsper. 2008. Parental mediation of children’s internet use.Journal of Broadcasting & Electronic Media52, 4 (2008), 581–599. doi:10.1080/08838150802437396

work page doi:10.1080/08838150802437396 2008

[21] [21]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919

work page 2020

[22] [22]

Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi

work page

[23] [23]

Offsetbias: Leveraging debiased data for tuning evaluators.arXiv preprint arXiv:2407.06551(2024)

work page arXiv 2024

[24] [24]

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang

work page

[25] [25]

Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos. arXiv:2502.01549 [cs.IR] https://arxiv.org/abs/2502.01549

work page arXiv

[26] [26]

Anna Riedmann, Philipp Schaper, and Birgit Lugrin. 2025. Reinforcement learning in education: A systematic literature review.International Journal of Artificial Intelligence in Education35 (2025), 1–65. doi:10.1007/s40593-025-00494-6

work page doi:10.1007/s40593-025-00494-6 2025

[27] [27]

Landon Ring, Liyan Shi, Kayla Totzke, and Timothy Bickmore. 2015. Social support agents for older adults: Longitudinal affective computing in the home. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction. 551–557. doi:10.1109/ACII.2015.7344662

work page doi:10.1109/acii.2015.7344662 2015

[28] [28]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid

work page

[29] [29]

InProceedings of the IEEE/CVF International Conference on Computer Vision

VideoBERT: A joint model for video and language representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7463–

work page

[30] [30]

doi:10.1109/ICCV.2019.00757

work page doi:10.1109/iccv.2019.00757 2019

[31] [31]

Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2023. Self-supervised learning for multimedia recommenda- tion.IEEE Transactions on Multimedia25 (2023), 5107–5116. doi:10.1109/TMM. 2022.3177882

work page doi:10.1109/tmm 2023

[32] [32]

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul- shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vince...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Jun Wang and Ying Zhao. 2022. Affective video content analysis and recommen- dation: A survey.IEEE Access10 (2022), 126430–126447. doi:10.1109/ACCESS. 2022.3195050

work page doi:10.1109/access 2022

[34] [34]

Qifan Wang, Yinwei Wei, Jianhua Yin, Jianwei Wu, Xuemeng Song, and Liqiang Nie. 2023. DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia25 (2023), 1074–1084. doi:10.1109/TMM.2021. 3138298

work page doi:10.1109/tmm.2021 2023

[35] [35]

Feng Yuan, Rui Zhang, Dania Bilal, and Xiaopeng Zhao. 2021. Learning-based strategy design for robot-assisted reminiscence therapy based on a developed model for people with dementia. InProceedings of the International Conference on Social Robotics. 432–442. doi:10.1007/978-3-030-85717-1_42

work page doi:10.1007/978-3-030-85717-1_42 2021

[36] [36]

Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck

work page

[37] [37]

Dirty clicks: A study of the usability and security implications of click-related behaviors on the web

Generating clarifying questions for information retrieval. InProceedings of The Web Conference 2020. ACM, 418–428. doi:10.1145/3366423.3380126 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhao et al

work page doi:10.1145/3366423.3380126 2020

[38] [38]

Yongfeng Zhang and Xu Chen. 2020. Explainable recommendation: A survey and new perspectives.Foundations and Trends in Information Retrieval14, 1 (2020), 1–101. doi:10.1561/1500000071

work page doi:10.1561/1500000071 2020

[39] [39]

Wenzheng Zhao. 2026. An Edge–Host–Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care.IEEE Transactions on Robotics (T-RO)(2026)

work page 2026

[40] [40]

based on the specific clinical scenario

Wenzheng Zhao, Kruthika Gangaraju, and Fengpei Yuan. 2025. Multimodal Perception-Driven Decision-Making for Human-Robot Interaction: A Survey. Frontiers in Robotics and AI12 (2025), 1604472. A Implementation and Execution Protocol SafeScreen operates across multiple environments: GPT-4 API for profile extraction, risk detection, and question generation; N...

work page 2025

[41] [41]

car videos

avoids accuracy thresholds, acknowledging metrics vary by content type and harm severity; for vulnerable populations, false negatives (showing harmful content) carry greater risk than false positives (over-cautious rejection). B.2 Hybrid AI-Human Evaluation Approach Following validation methodologies for LLM-as-a-judge frame- works [17, 21], we employ hyb...

work page 2018