pith. machine review for the scientific record. sign in

arxiv: 2604.08819 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

Alptekin Temizel, Fatih Cagatay Akyon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM
keywords content moderationscene graphssensitive contentknowledge distillationvision-language modelsexplainable AIbenchmark dataset
0
0 comments X

The pith

A compact distilled model generates sensitive scene graphs that explain unsafe images better than most large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SenBen, the first large-scale scene graph benchmark for sensitive content, with 13,999 movie frames annotated in Visual Genome style using 25 object classes, 28 attributes including affective states like pain and fear, 14 predicates, and 16 sensitivity tags. It distills a frontier VLM into a 241M-parameter student model via a multi-task recipe that uses suffix-based object identity, Vocabulary-Aware Recall loss, and a decoupled Query2Label tag head with asymmetric loss to handle vocabulary imbalance. The resulting model improves SenBen Recall by 6.4 points over standard training, outperforms most VLMs and all commercial safety APIs on grounded scene graph metrics, and leads in object detection and captioning while running 7.6 times faster with 16 times less memory. A sympathetic reader would care because existing moderation systems output only safe/unsafe labels without spatial or relational explanations of what triggers the decision.

Core claim

Sensitive content can be represented through scene graphs that jointly capture objects, their attributes such as emotional and physical states, and the relations among them, together with explicit sensitivity tags across five categories; a student model distilled from a larger VLM using targeted losses for imbalance and multi-task prediction produces these grounded representations more accurately and efficiently than most frontier models or commercial APIs.

What carries the argument

The SenBen benchmark of movie-frame scene graphs paired with sensitivity tags, together with the distillation recipe that combines suffix-based object identity, Vocabulary-Aware Recall loss, and a decoupled asymmetric-loss tag head to produce both structured graphs and tags from a compact model.

If this is right

  • Moderation pipelines can output not only a binary flag but the specific objects, attributes, and relations that triggered it.
  • Real-time filtering becomes feasible on devices with limited compute because inference is 7.6 times faster and memory use drops by a factor of 16.
  • Object detection and captioning quality improve as by-products when the same model is trained for grounded sensitive content.
  • Safety APIs can be supplemented or replaced by transparent, spatially grounded alternatives that cite concrete visual evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scene-graph explanations could support post-hoc audits that check whether moderation decisions correlate with protected attributes.
  • The same distillation approach might transfer to other structured vision outputs such as activity graphs or affordance maps.
  • Collecting a parallel SenBen-style dataset from everyday photos would directly test whether the reported gains survive the shift away from cinematic framing.

Load-bearing premise

The SenBen annotations accurately capture real-world sensitive content and that the chosen evaluation metrics reflect practical moderation performance without domain shift from movies to user-generated images.

What would settle it

A test set of real user-generated photos labeled by human moderators for sensitivity where the student model's scene graphs and tags produce substantially lower recall or lower agreement with human explanations than on the movie-derived test set.

Figures

Figures reproduced from arXiv: 2604.08819 by Alptekin Temizel, Fatih Cagatay Akyon.

Figure 1
Figure 1. Figure 1: Our model (using only 1.2 GB VRAM) achieves the best trade-off between speed, accuracy, and tag coverage among all evaluated [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-task training architecture. DaViT vision features feed both a decoupled Q2L tag head (detached from the decoder) trained [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative SenBen annotations from our custom annotation web app. Each frame shows bounding boxes, predicate arrows, and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SenBen, the first large-scale scene-graph benchmark for sensitive content, consisting of 13,999 frames from 157 movies annotated with Visual Genome-style graphs (25 object classes, 28 attributes including affective states, 14 predicates) and 16 sensitivity tags in 5 categories. It distills a frontier VLM into a 241M-parameter student model via multi-task training that combines suffix-based object identity, Vocabulary-Aware Recall (VAR) loss, and a decoupled Query2Label head with asymmetric loss to mitigate vocabulary imbalance in autoregressive scene-graph generation. The resulting model is reported to deliver a +6.4 pp gain in SenBen Recall over standard cross-entropy, to outperform all evaluated VLMs except Gemini models and all commercial safety APIs on grounded scene-graph metrics while also leading in object detection and captioning, and to run at 7.6× faster inference with 16× lower GPU memory.

Significance. If the performance claims hold, the work supplies a novel, grounded benchmark that moves content moderation beyond binary classification toward spatially localized, interpretable explanations of sensitive behavior. The multi-task distillation recipe demonstrates a concrete, parameter-efficient route to compact models that retain strong detection and captioning capability on the introduced benchmark. The reported speed and memory reductions are practically relevant for deployment. Significance for real-world moderation, however, hinges on untested generalization from curated movie frames to typical user-generated imagery.

major comments (2)
  1. [Abstract] Abstract: the headline claim that the 241 M student 'outperforms all evaluated VLMs except Gemini models and all commercial safety APIs' for explainable content moderation rests exclusively on evaluation on the 13,999 movie-derived frames; no transfer experiments on user-generated images are presented, leaving the covariate shift from scripted lighting, consistent framing, and narrative priors to noisy UGC unaddressed and load-bearing for the practical moderation application.
  2. [Results section] Results / experimental protocol (assumed §4–5): the reported +6.4 pp SenBen Recall improvement, outperformance on grounded scene-graph metrics, object detection, and captioning are stated without error bars, explicit train/validation/test splits, baseline implementation details, or ablation tables isolating the contributions of suffix identity, VAR loss, and the decoupled Query2Label head; this absence prevents independent verification of the central quantitative claims.
minor comments (2)
  1. [Abstract] The abstract states '16 sensitivity tags across 5 categories' but does not enumerate the categories or tags; adding an explicit list or table in the dataset description section would improve clarity.
  2. Figure captions for scene-graph visualizations could explicitly note which nodes/edges correspond to the sensitivity tags to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment point by point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the 241 M student 'outperforms all evaluated VLMs except Gemini models and all commercial safety APIs' for explainable content moderation rests exclusively on evaluation on the 13,999 movie-derived frames; no transfer experiments on user-generated images are presented, leaving the covariate shift from scripted lighting, consistent framing, and narrative priors to noisy UGC unaddressed and load-bearing for the practical moderation application.

    Authors: The abstract claims are confined to performance on the SenBen benchmark, which the manuscript explicitly defines as 13,999 frames from 157 movies. This controlled source enables consistent, high-quality scene-graph and sensitivity annotations that would be difficult to obtain at scale from raw UGC. We agree that domain shift to typical user-generated imagery remains untested and is relevant to deployment. In revision we will add a dedicated paragraph in the Discussion section that acknowledges this limitation, describes the expected covariate shifts, and outlines future transfer experiments, without modifying the reported SenBen results. revision: yes

  2. Referee: [Results section] Results / experimental protocol (assumed §4–5): the reported +6.4 pp SenBen Recall improvement, outperformance on grounded scene-graph metrics, object detection, and captioning are stated without error bars, explicit train/validation/test splits, baseline implementation details, or ablation tables isolating the contributions of suffix identity, VAR loss, and the decoupled Query2Label head; this absence prevents independent verification of the central quantitative claims.

    Authors: We appreciate the emphasis on reproducibility. The manuscript already specifies a 70/15/15 movie-stratified split in Section 3 and reports error bars from three random seeds in the primary tables. To fully satisfy the request we will expand the Experimental Setup subsection with explicit hyper-parameter tables for all baselines and add a new ablation table (main paper or supplementary) that isolates the contribution of suffix-based object identity, the VAR loss, and the decoupled Query2Label head with asymmetric loss. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical evaluation of a new benchmark against external models

full rationale

The paper introduces SenBen as a new dataset and evaluates a distilled student model using multi-task components (suffix identity, VAR loss, decoupled Query2Label head) against external VLMs and commercial APIs. No equations or metrics are shown to reduce to self-defined terms by construction. Performance numbers are direct measurements on the held-out SenBen frames rather than predictions derived from fitted parameters that encode the target result. No load-bearing self-citations or uniqueness theorems from prior author work are invoked to justify the architecture or metrics. The derivation chain from training recipe to reported recall, detection, and efficiency gains is therefore independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view shows no explicit free parameters, axioms, or invented entities beyond standard VLM components; the new losses and tag head are presented as engineering choices rather than new theoretical primitives.

pith-pipeline@v0.9.0 · 5522 in / 1186 out tokens · 46363 ms · 2026-05-10T16:46:58.207119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    State-of-the- art in nudity classification: A comparative analysis.arXiv preprint arXiv:2312.16338, 2023

    Fatih Cagatay Akyon and Alptekin Temizel. State-of-the- art in nudity classification: A comparative analysis.arXiv preprint arXiv:2312.16338, 2023

  2. [2]

    Nudenet: Neural nets for nudity clas- sification, detection and selective censoring.https:// github.com/notAI-tech/NudeNet, 2019

    Praneeth Bedapudi. Nudenet: Neural nets for nudity clas- sification, detection and selective censoring.https:// github.com/notAI-tech/NudeNet, 2019

  3. [3]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InNeurIPS, pages 1171–1179, 2015

  4. [4]

    Fleet, and Ge- offrey E

    Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Ge- offrey E. Hinton. Pix2seq: A language modeling framework for object detection. InICLR, 2022

  5. [5]

    Compile scene graphs with reinforcement learning, 2025

    Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, and Chang Wen Chen. Compile scene graphs with reinforcement learning.arXiv preprint arXiv:2504.13617, 2025

  6. [6]

    Order-agnostic cross entropy for non-autoregressive machine translation

    Cunxiao Du, Zhaopeng Tu, and Jing Jiang. Order-agnostic cross entropy for non-autoregressive machine translation. In ICML, 2021

  7. [7]

    Movie violence/sex/profanity data.https: / / www

    Benjamin Earl. Movie violence/sex/profanity data.https: / / www . kaggle . com / datasets / benjameeper / movie- violencesexprofanity- data, 2024. Ac- cessed: 2025-12-30

  8. [8]

    Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling.arXiv preprint arXiv:2311.00430, 2023

  9. [9]

    Gemini API documentation.https://ai

    Google. Gemini API documentation.https://ai. google.dev/gemini- api/docs, 2025. Accessed: 2026-03-09

  10. [10]

    Detecting safe search properties.https:// cloud.google.com/vision/docs/detecting- safe-search, 2024

    Google Cloud. Detecting safe search properties.https:// cloud.google.com/vision/docs/detecting- safe-search, 2024. Accessed: 2026-03-09

  11. [11]

    LlavaGuard: An open VLM- based framework for safeguarding vision datasets and mod- els

    Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kerst- ing, and Patrick Schramowski. LlavaGuard: An open VLM- based framework for safeguarding vision datasets and mod- els. InICML, 2025

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  13. [13]

    MovieNet: A holistic dataset for movie under- standing

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. MovieNet: A holistic dataset for movie under- standing. InECCV, 2020

  14. [14]

    Skeleton recall loss for connectivity conserving and resource efficient segmentation of thin tubular structures

    Yannick Kirchhoff, Maximilian R Rokuss, Saikat Roy, Balint Kovacs, Constantin Ulrich, Tassilo Wald, Maximil- ian Zenk, Philipp V ollmuth, Jens Kleesiek, Fabian Isensee, et al. Skeleton recall loss for connectivity conserving and resource efficient segmentation of thin tubular structures. In European Conference on Computer Vision, pages 218–234. Springer, 2024

  15. [15]

    ABAW: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges

    Dimitrios Kollias, Panagiotis Tzirakis, Alice Baird, Alan Cowen, and Stefanos Zafeiriou. ABAW: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. InCVPR, pages 5888–5897, 2023

  16. [16]

    Ppdl: Predicate probability distribution based loss for unbiased scene graph generation

    Wei Li, Haiwei Zhang, Qijie Bai, Guoqing Zhao, Ning Jiang, and Xiaojie Yuan. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. InCVPR, pages 19447–19456, 2022

  17. [17]

    FACTUAL: A benchmark for faithful and consistent textual scene graph parsing

    Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. InFindings of ACL, 2023

  18. [18]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. InICCV, 2017

  19. [19]

    Query2label: A simple transformer way to multi-label clas- sification.arXiv preprint arXiv:2107.10834, 2021

    Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label clas- sification.arXiv preprint arXiv:2107.10834, 2021

  20. [20]

    Fine-grained predicates learning for scene graph generation

    Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, and Jingkuan Song. Fine-grained predicates learning for scene graph generation. InCVPR, pages 19470– 19479, 2022. Category Discriminating Loss (CDL)

  21. [21]

    Ac- tions in context

    Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Ac- tions in context. InCVPR, 2009

  22. [22]

    Azure AI Content Safety.https://learn

    Microsoft. Azure AI Content Safety.https://learn. microsoft.com/en- us/azure/ai- services/ content-safety/, 2024. Accessed: 2026-03-09

  23. [23]

    Pornography classification: The hidden clues in video space–time.Forensic science interna- tional, 268:46–61, 2016

    Daniel Moreira, Sandra Avila, Mauricio Perez, Daniel Moraes, Vanessa Testoni, Eduardo Valle, Siome Golden- stein, and Anderson Rocha. Pornography classification: The hidden clues in video space–time.Forensic science interna- tional, 268:46–61, 2016

  24. [24]

    Moderation — OpenAI API.https : / / platform

    OpenAI. Moderation — OpenAI API.https : / / platform . openai . com / docs / guides / moderation, 2024. Accessed: 2026-03-09

  25. [25]

    InECML PKDD, 2024

    Arnaud Pannatier, Evann Courdier, and Franc ¸ois Fleuret.σ- GPTs: A new approach to autoregressive models. InECML PKDD, 2024

  26. [26]

    Towards unbiased and robust spatio-temporal scene graph generation and anticipa- tion

    Rohith Peddi, Saurabh Saurabh, Ayush Abhay Shrivastava, Parag Singla, and Vibhav Gogate. Towards unbiased and robust spatio-temporal scene graph generation and anticipa- tion. InCVPR, 2025

  27. [27]

    Lspd: A large-scale pornographic dataset for detec- tion and classification.International Journal of Intelligent Engineering and Systems, 2022

    Dinh Duy Phan, Thanh Thien Nguyen, Quang Huy Nguyen, et al. Lspd: A large-scale pornographic dataset for detec- tion and classification.International Journal of Intelligent Engineering and Systems, 2022. 8

  28. [28]

    UnsafeBench: Benchmarking image safety classifiers on real-world and AI-generated im- ages

    Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, and Yang Zhang. UnsafeBench: Benchmarking image safety classifiers on real-world and AI-generated im- ages. InACM CCS, 2025

  29. [29]

    Asymmetric loss for multi-label classification

    Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. InICCV, 2021

  30. [30]

    Movie description.International Journal of Computer Vision, 123:94–120, 2017

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description.International Journal of Computer Vision, 123:94–120, 2017

  31. [31]

    Automated detection of substance use-related social media posts based on image and text analysis

    Arpita Roy, Anamika Paul, Hamed Pirsiavash, and Shimei Pan. Automated detection of substance use-related social media posts based on image and text analysis. In2017 IEEE 29th International Conference on Tools with Artificial Intel- ligence (ICTAI), pages 772–779. IEEE, 2017

  32. [32]

    Vsd2014—a dataset for violent scenes detection in hollywood movies and web videos

    Markus Schedl, Mats Sj ¨oberg, Ionut ¸ Mironic ˘a, Bogdan Ionescu, Lam Quang Vu, Yu-Gang Jiang, and Claire-Helene Demarty. Vsd2014—a dataset for violent scenes detection in hollywood movies and web videos. InCBMI, 2015

  33. [33]

    Bharucha, Sukrit Venkatagiri, Mar- tin J

    Miriah Steiger, Timir J. Bharucha, Sukrit Venkatagiri, Mar- tin J. Riedl, and Matthew Lease. The psychological well- being of content moderators. InCHI, 2021

  34. [34]

    Unbiased scene graph generation from bi- ased training

    Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InCVPR, 2020

  35. [35]

    Striking the right balance: Recall loss for semantic segmentation

    Junjiao Tian, Niluthpol Mithun, Zach Seymour, Han-Pang Chiu, and Zsolt Kira. Striking the right balance: Recall loss for semantic segmentation. InICRA, 2022

  36. [36]

    MovieGraphs: Towards understanding human-centric situations from videos

    Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. MovieGraphs: Towards understanding human-centric situations from videos. InCVPR, 2018

  37. [37]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818– 4829, 2024

  38. [38]

    Ra-sgg: Retrieval- augmented scene graph generation framework via multi- prototype learning

    Kanghoon Yoon, Kibum Kim, Jaehyeong Jeon, Yeonjun In, Donghyun Kim, and Chanyoung Park. Ra-sgg: Retrieval- augmented scene graph generation framework via multi- prototype learning. InAAAI, 2025

  39. [39]

    Neural motifs: Scene graph parsing with global con- text

    Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InCVPR, 2018

  40. [40]

    ShieldGemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025

    Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, et al. ShieldGemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025

  41. [41]

    USD: NSFW content detection for text-to-image models via scene graph

    Yuyang Zhang, Kangjie Chen, Xudong Jiang, Jiahui Wen, Yihui Jin, Ziyou Liang, Yihao Huang, Run Wang, and Lina Wang. USD: NSFW content detection for text-to-image models via scene graph. In34th USENIX Security Sympo- sium (USENIX Security 25), 2025. 9