Recognition: unknown
SenBen: Sensitive Scene Graphs for Explainable Content Moderation
Pith reviewed 2026-05-10 16:46 UTC · model grok-4.3
The pith
A compact distilled model generates sensitive scene graphs that explain unsafe images better than most large vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sensitive content can be represented through scene graphs that jointly capture objects, their attributes such as emotional and physical states, and the relations among them, together with explicit sensitivity tags across five categories; a student model distilled from a larger VLM using targeted losses for imbalance and multi-task prediction produces these grounded representations more accurately and efficiently than most frontier models or commercial APIs.
What carries the argument
The SenBen benchmark of movie-frame scene graphs paired with sensitivity tags, together with the distillation recipe that combines suffix-based object identity, Vocabulary-Aware Recall loss, and a decoupled asymmetric-loss tag head to produce both structured graphs and tags from a compact model.
If this is right
- Moderation pipelines can output not only a binary flag but the specific objects, attributes, and relations that triggered it.
- Real-time filtering becomes feasible on devices with limited compute because inference is 7.6 times faster and memory use drops by a factor of 16.
- Object detection and captioning quality improve as by-products when the same model is trained for grounded sensitive content.
- Safety APIs can be supplemented or replaced by transparent, spatially grounded alternatives that cite concrete visual evidence.
Where Pith is reading between the lines
- Scene-graph explanations could support post-hoc audits that check whether moderation decisions correlate with protected attributes.
- The same distillation approach might transfer to other structured vision outputs such as activity graphs or affordance maps.
- Collecting a parallel SenBen-style dataset from everyday photos would directly test whether the reported gains survive the shift away from cinematic framing.
Load-bearing premise
The SenBen annotations accurately capture real-world sensitive content and that the chosen evaluation metrics reflect practical moderation performance without domain shift from movies to user-generated images.
What would settle it
A test set of real user-generated photos labeled by human moderators for sensitivity where the student model's scene graphs and tags produce substantially lower recall or lower agreement with human explanations than on the movie-derived test set.
Figures
read the original abstract
Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SenBen, the first large-scale scene-graph benchmark for sensitive content, consisting of 13,999 frames from 157 movies annotated with Visual Genome-style graphs (25 object classes, 28 attributes including affective states, 14 predicates) and 16 sensitivity tags in 5 categories. It distills a frontier VLM into a 241M-parameter student model via multi-task training that combines suffix-based object identity, Vocabulary-Aware Recall (VAR) loss, and a decoupled Query2Label head with asymmetric loss to mitigate vocabulary imbalance in autoregressive scene-graph generation. The resulting model is reported to deliver a +6.4 pp gain in SenBen Recall over standard cross-entropy, to outperform all evaluated VLMs except Gemini models and all commercial safety APIs on grounded scene-graph metrics while also leading in object detection and captioning, and to run at 7.6× faster inference with 16× lower GPU memory.
Significance. If the performance claims hold, the work supplies a novel, grounded benchmark that moves content moderation beyond binary classification toward spatially localized, interpretable explanations of sensitive behavior. The multi-task distillation recipe demonstrates a concrete, parameter-efficient route to compact models that retain strong detection and captioning capability on the introduced benchmark. The reported speed and memory reductions are practically relevant for deployment. Significance for real-world moderation, however, hinges on untested generalization from curated movie frames to typical user-generated imagery.
major comments (2)
- [Abstract] Abstract: the headline claim that the 241 M student 'outperforms all evaluated VLMs except Gemini models and all commercial safety APIs' for explainable content moderation rests exclusively on evaluation on the 13,999 movie-derived frames; no transfer experiments on user-generated images are presented, leaving the covariate shift from scripted lighting, consistent framing, and narrative priors to noisy UGC unaddressed and load-bearing for the practical moderation application.
- [Results section] Results / experimental protocol (assumed §4–5): the reported +6.4 pp SenBen Recall improvement, outperformance on grounded scene-graph metrics, object detection, and captioning are stated without error bars, explicit train/validation/test splits, baseline implementation details, or ablation tables isolating the contributions of suffix identity, VAR loss, and the decoupled Query2Label head; this absence prevents independent verification of the central quantitative claims.
minor comments (2)
- [Abstract] The abstract states '16 sensitivity tags across 5 categories' but does not enumerate the categories or tags; adding an explicit list or table in the dataset description section would improve clarity.
- Figure captions for scene-graph visualizations could explicitly note which nodes/edges correspond to the sensitivity tags to aid reader interpretation.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major comment point by point below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that the 241 M student 'outperforms all evaluated VLMs except Gemini models and all commercial safety APIs' for explainable content moderation rests exclusively on evaluation on the 13,999 movie-derived frames; no transfer experiments on user-generated images are presented, leaving the covariate shift from scripted lighting, consistent framing, and narrative priors to noisy UGC unaddressed and load-bearing for the practical moderation application.
Authors: The abstract claims are confined to performance on the SenBen benchmark, which the manuscript explicitly defines as 13,999 frames from 157 movies. This controlled source enables consistent, high-quality scene-graph and sensitivity annotations that would be difficult to obtain at scale from raw UGC. We agree that domain shift to typical user-generated imagery remains untested and is relevant to deployment. In revision we will add a dedicated paragraph in the Discussion section that acknowledges this limitation, describes the expected covariate shifts, and outlines future transfer experiments, without modifying the reported SenBen results. revision: yes
-
Referee: [Results section] Results / experimental protocol (assumed §4–5): the reported +6.4 pp SenBen Recall improvement, outperformance on grounded scene-graph metrics, object detection, and captioning are stated without error bars, explicit train/validation/test splits, baseline implementation details, or ablation tables isolating the contributions of suffix identity, VAR loss, and the decoupled Query2Label head; this absence prevents independent verification of the central quantitative claims.
Authors: We appreciate the emphasis on reproducibility. The manuscript already specifies a 70/15/15 movie-stratified split in Section 3 and reports error bars from three random seeds in the primary tables. To fully satisfy the request we will expand the Experimental Setup subsection with explicit hyper-parameter tables for all baselines and add a new ablation table (main paper or supplementary) that isolates the contribution of suffix-based object identity, the VAR loss, and the decoupled Query2Label head with asymmetric loss. revision: yes
Circularity Check
No circularity; claims rest on empirical evaluation of a new benchmark against external models
full rationale
The paper introduces SenBen as a new dataset and evaluates a distilled student model using multi-task components (suffix identity, VAR loss, decoupled Query2Label head) against external VLMs and commercial APIs. No equations or metrics are shown to reduce to self-defined terms by construction. Performance numbers are direct measurements on the held-out SenBen frames rather than predictions derived from fitted parameters that encode the target result. No load-bearing self-citations or uniqueness theorems from prior author work are invoked to justify the architecture or metrics. The derivation chain from training recipe to reported recall, detection, and efficiency gains is therefore independent of the inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Fatih Cagatay Akyon and Alptekin Temizel. State-of-the- art in nudity classification: A comparative analysis.arXiv preprint arXiv:2312.16338, 2023
-
[2]
Nudenet: Neural nets for nudity clas- sification, detection and selective censoring.https:// github.com/notAI-tech/NudeNet, 2019
Praneeth Bedapudi. Nudenet: Neural nets for nudity clas- sification, detection and selective censoring.https:// github.com/notAI-tech/NudeNet, 2019
2019
-
[3]
Scheduled sampling for sequence prediction with recurrent neural networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InNeurIPS, pages 1171–1179, 2015
2015
-
[4]
Fleet, and Ge- offrey E
Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Ge- offrey E. Hinton. Pix2seq: A language modeling framework for object detection. InICLR, 2022
2022
-
[5]
Compile scene graphs with reinforcement learning, 2025
Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, and Chang Wen Chen. Compile scene graphs with reinforcement learning.arXiv preprint arXiv:2504.13617, 2025
-
[6]
Order-agnostic cross entropy for non-autoregressive machine translation
Cunxiao Du, Zhaopeng Tu, and Jing Jiang. Order-agnostic cross entropy for non-autoregressive machine translation. In ICML, 2021
2021
-
[7]
Movie violence/sex/profanity data.https: / / www
Benjamin Earl. Movie violence/sex/profanity data.https: / / www . kaggle . com / datasets / benjameeper / movie- violencesexprofanity- data, 2024. Ac- cessed: 2025-12-30
2024
- [8]
-
[9]
Gemini API documentation.https://ai
Google. Gemini API documentation.https://ai. google.dev/gemini- api/docs, 2025. Accessed: 2026-03-09
2025
-
[10]
Detecting safe search properties.https:// cloud.google.com/vision/docs/detecting- safe-search, 2024
Google Cloud. Detecting safe search properties.https:// cloud.google.com/vision/docs/detecting- safe-search, 2024. Accessed: 2026-03-09
2024
-
[11]
LlavaGuard: An open VLM- based framework for safeguarding vision datasets and mod- els
Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kerst- ing, and Patrick Schramowski. LlavaGuard: An open VLM- based framework for safeguarding vision datasets and mod- els. InICML, 2025
2025
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
MovieNet: A holistic dataset for movie under- standing
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. MovieNet: A holistic dataset for movie under- standing. InECCV, 2020
2020
-
[14]
Skeleton recall loss for connectivity conserving and resource efficient segmentation of thin tubular structures
Yannick Kirchhoff, Maximilian R Rokuss, Saikat Roy, Balint Kovacs, Constantin Ulrich, Tassilo Wald, Maximil- ian Zenk, Philipp V ollmuth, Jens Kleesiek, Fabian Isensee, et al. Skeleton recall loss for connectivity conserving and resource efficient segmentation of thin tubular structures. In European Conference on Computer Vision, pages 218–234. Springer, 2024
2024
-
[15]
ABAW: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges
Dimitrios Kollias, Panagiotis Tzirakis, Alice Baird, Alan Cowen, and Stefanos Zafeiriou. ABAW: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. InCVPR, pages 5888–5897, 2023
2023
-
[16]
Ppdl: Predicate probability distribution based loss for unbiased scene graph generation
Wei Li, Haiwei Zhang, Qijie Bai, Guoqing Zhao, Ning Jiang, and Xiaojie Yuan. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. InCVPR, pages 19447–19456, 2022
2022
-
[17]
FACTUAL: A benchmark for faithful and consistent textual scene graph parsing
Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of ACL, 2023
2023
-
[18]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. InICCV, 2017
2017
-
[19]
Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label clas- sification.arXiv preprint arXiv:2107.10834, 2021
-
[20]
Fine-grained predicates learning for scene graph generation
Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, and Jingkuan Song. Fine-grained predicates learning for scene graph generation. InCVPR, pages 19470– 19479, 2022. Category Discriminating Loss (CDL)
2022
-
[21]
Ac- tions in context
Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Ac- tions in context. InCVPR, 2009
2009
-
[22]
Azure AI Content Safety.https://learn
Microsoft. Azure AI Content Safety.https://learn. microsoft.com/en- us/azure/ai- services/ content-safety/, 2024. Accessed: 2026-03-09
2024
-
[23]
Pornography classification: The hidden clues in video space–time.Forensic science interna- tional, 268:46–61, 2016
Daniel Moreira, Sandra Avila, Mauricio Perez, Daniel Moraes, Vanessa Testoni, Eduardo Valle, Siome Golden- stein, and Anderson Rocha. Pornography classification: The hidden clues in video space–time.Forensic science interna- tional, 268:46–61, 2016
2016
-
[24]
Moderation — OpenAI API.https : / / platform
OpenAI. Moderation — OpenAI API.https : / / platform . openai . com / docs / guides / moderation, 2024. Accessed: 2026-03-09
2024
-
[25]
InECML PKDD, 2024
Arnaud Pannatier, Evann Courdier, and Franc ¸ois Fleuret.σ- GPTs: A new approach to autoregressive models. InECML PKDD, 2024
2024
-
[26]
Towards unbiased and robust spatio-temporal scene graph generation and anticipa- tion
Rohith Peddi, Saurabh Saurabh, Ayush Abhay Shrivastava, Parag Singla, and Vibhav Gogate. Towards unbiased and robust spatio-temporal scene graph generation and anticipa- tion. InCVPR, 2025
2025
-
[27]
Lspd: A large-scale pornographic dataset for detec- tion and classification.International Journal of Intelligent Engineering and Systems, 2022
Dinh Duy Phan, Thanh Thien Nguyen, Quang Huy Nguyen, et al. Lspd: A large-scale pornographic dataset for detec- tion and classification.International Journal of Intelligent Engineering and Systems, 2022. 8
2022
-
[28]
UnsafeBench: Benchmarking image safety classifiers on real-world and AI-generated im- ages
Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, and Yang Zhang. UnsafeBench: Benchmarking image safety classifiers on real-world and AI-generated im- ages. InACM CCS, 2025
2025
-
[29]
Asymmetric loss for multi-label classification
Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. InICCV, 2021
2021
-
[30]
Movie description.International Journal of Computer Vision, 123:94–120, 2017
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description.International Journal of Computer Vision, 123:94–120, 2017
2017
-
[31]
Automated detection of substance use-related social media posts based on image and text analysis
Arpita Roy, Anamika Paul, Hamed Pirsiavash, and Shimei Pan. Automated detection of substance use-related social media posts based on image and text analysis. In2017 IEEE 29th International Conference on Tools with Artificial Intel- ligence (ICTAI), pages 772–779. IEEE, 2017
2017
-
[32]
Vsd2014—a dataset for violent scenes detection in hollywood movies and web videos
Markus Schedl, Mats Sj ¨oberg, Ionut ¸ Mironic ˘a, Bogdan Ionescu, Lam Quang Vu, Yu-Gang Jiang, and Claire-Helene Demarty. Vsd2014—a dataset for violent scenes detection in hollywood movies and web videos. InCBMI, 2015
2015
-
[33]
Bharucha, Sukrit Venkatagiri, Mar- tin J
Miriah Steiger, Timir J. Bharucha, Sukrit Venkatagiri, Mar- tin J. Riedl, and Matthew Lease. The psychological well- being of content moderators. InCHI, 2021
2021
-
[34]
Unbiased scene graph generation from bi- ased training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InCVPR, 2020
2020
-
[35]
Striking the right balance: Recall loss for semantic segmentation
Junjiao Tian, Niluthpol Mithun, Zach Seymour, Han-Pang Chiu, and Zsolt Kira. Striking the right balance: Recall loss for semantic segmentation. InICRA, 2022
2022
-
[36]
MovieGraphs: Towards understanding human-centric situations from videos
Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. MovieGraphs: Towards understanding human-centric situations from videos. InCVPR, 2018
2018
-
[37]
Florence-2: Advancing a unified representation for a variety of vision tasks
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818– 4829, 2024
2024
-
[38]
Ra-sgg: Retrieval- augmented scene graph generation framework via multi- prototype learning
Kanghoon Yoon, Kibum Kim, Jaehyeong Jeon, Yeonjun In, Donghyun Kim, and Chanyoung Park. Ra-sgg: Retrieval- augmented scene graph generation framework via multi- prototype learning. InAAAI, 2025
2025
-
[39]
Neural motifs: Scene graph parsing with global con- text
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InCVPR, 2018
2018
-
[40]
ShieldGemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025
Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, et al. ShieldGemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025
-
[41]
USD: NSFW content detection for text-to-image models via scene graph
Yuyang Zhang, Kangjie Chen, Xudong Jiang, Jiahui Wen, Yihui Jin, Ziyou Liang, Yihao Huang, Run Wang, and Lina Wang. USD: NSFW content detection for text-to-image models via scene graph. In34th USENIX Security Sympo- sium (USENIX Security 25), 2025. 9
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.