pith. sign in

arxiv: 2605.20584 · v1 · pith:PBWPTGM6new · submitted 2026-05-20 · 💻 cs.CV

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

Pith reviewed 2026-05-21 06:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords content rating descriptorsvision-language modelsmultimodal classificationpreference optimizationmobile app marketplacescontent moderationsupervised fine-tuning
0
0 comments X

The pith

QwenSafe outperforms existing vision-language models at classifying content rating descriptors by using preference alignment on multimodal app data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops QwenSafe to automatically identify standardized content rating descriptors required for mobile apps by processing both app metadata and screenshots together. The challenge is that these descriptors must accurately reflect potentially sensitive content, yet manual verification does not scale well for large marketplaces. The authors create a pipeline called metadata2CRD to build training data that links app materials to specific descriptor definitions and then apply supervised fine-tuning plus direct preference optimization to make the model prefer answers supported by evidence. Evaluation across twelve descriptors shows consistent gains in correctly spotting when a descriptor applies.

Core claim

By adapting Qwen3-VL-8B with the metadata2CRD pipeline for data synthesis and then applying direct preference optimization, the resulting QwenSafe model achieves higher accuracy in binary classification of Apple content rating descriptors than the base model and other leading vision-language models. The improvements are particularly notable in positive-class recall, reaching 111.8% over one baseline, 36.1% over another, and 2.1% over the third. This establishes that aligning model predictions to descriptor-specific multimodal evidence enhances automated content rating tasks.

What carries the argument

metadata2CRD pipeline for creating aligned question-answer pairs combined with direct preference optimization to align the VLM outputs to visual and textual evidence for each content rating descriptor

Load-bearing premise

The data generated by the metadata2CRD pipeline produces high-quality pairs that represent real app content and enable the model to generalize without biases introduced by synthesis or image interpretation.

What would settle it

A large-scale evaluation against human expert labels on actual submitted apps, checking whether the reported recall improvements persist outside the synthetic training distribution.

Figures

Figures reproduced from arXiv: 2605.20584 by Aruna Seneviratne, Dishanika Denipitiyage, Suranga Seneviratne.

Figure 1
Figure 1. Figure 1: Comparison of age ratings across major authorities (USK, PEGI, ESRB, IARC, ACB, and Apple) app content rating in Australia, as this research was conducted by setting the geographical location as Australia. Children need constant vigilance and effort to protect their personal data, as they often do not fully understand the risks of how their data is collected and used. According to COPPA §312.4 [35], applic… view at source ↗
Figure 2
Figure 2. Figure 2: Content descriptors of T he SimsTM F reeP lay and Netf lix apps across App Store and Play Store. fine granular second layer compared to apple. For example ACB divides violence into 12 sub-categories whereas Apple has only four sub-categories. Compared to Android, Apple recently introduces Parental Controls and Age Assurance as a protection layer for children under 16 years. 2.2 iOS and Android Content Rati… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Content descriptor taxonomy and (b) mapping to 12 different Apple content rating descriptors This process yields a unified, hierarchical taxonomy that fully covers all 12 content descriptors defined in the iOS ecosystem (cf [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overview of the QwenSafe pipeline. The pipeline involves four stage process. a) metadata2CRD: constructing QwenSafe training data from Apple app metadata, b) supervised fine tuning Qwen3-VL model, c) DPO dataset generation and d) DPO optimisation. information, as their large user base increases the likelihood of detecting and reporting violations, thereby enhancing the reliability of their content rati… view at source ↗
Figure 5
Figure 5. Figure 5: Example illustrating model behaviour on the Mature/Suggestive Themes de￾scriptor. QwenSafe recognises the subtle cues present in both the screenshot and de￾scription and accurately labels the impact as mild, demonstrating improved sensitivity to low-intensity content. Evaluation Metrics: The goal of QwenSafe is to reliably detect the presence of specific content rating descriptors in mobile app metadata. T… view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of non-disclosed CRDs identified by QwenSafe. (i) Distribution of applications containing non-disclosed CRDs across Apple age rating categories (4+, 9+, 12+, and 17+) and descriptor types. (ii) Representative examples of applications where QwenSafe detects CRDs that are not declared in the app metadata. restricted web access, and contests, as Apple does not provide severity annota￾tions). Full per… view at source ↗
read the original abstract

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents QwenSafe, a VLM based on Qwen3-VL-8B that is fine-tuned with SFT followed by DPO to identify the presence of 12 Apple-defined content rating descriptors (CRDs) from joint app metadata and screenshots. It introduces the metadata2CRD pipeline to synthesize descriptor-aligned QA pairs from descriptions, screenshots, and formal definitions. The central empirical claim is that QwenSafe outperforms baselines (Qwen3-VL, LLaVA-1.6, Gemini-2.5-Flash) in binary CRD classification, with positive-class recall gains of 111.8%, 36.1%, and 2.1% respectively.

Significance. If the reported gains reflect genuine multimodal generalization rather than pipeline artifacts, the work could support more scalable and consistent automated content rating for app marketplaces. The combination of descriptor-specific definitions with DPO for evidence alignment is a sensible technical choice for this safety-oriented task. However, the absence of dataset statistics, split details, and external validation substantially weakens the strength of the conclusions.

major comments (2)
  1. [§4.2 and Table 1] §4.2 and Table 1: The manuscript reports large positive-class recall improvements but provides no information on evaluation dataset size, per-descriptor sample counts, train-test split ratios, or statistical significance testing. This information is required to assess whether the 111.8%, 36.1%, and 2.1% gains are reliable or could arise from variance or imbalance.
  2. [§3.1] §3.1 (metadata2CRD pipeline): The evaluation set is generated by the same synthesis procedure used for training data, with no experiments or analysis addressing possible data leakage, keyword injection, or distribution shift. The central claim that QwenSafe performs robust joint metadata+screenshot reasoning therefore requires external validation on human-annotated real-world apps, which is not reported.
minor comments (2)
  1. [Abstract] Abstract: The sentence reporting recall improvements lists three percentages but does not explicitly map them to the three named baselines; adding this mapping would improve clarity.
  2. [§2] §2 (Related Work): The discussion of prior VLM safety and content moderation work is brief; adding references to recent multimodal safety benchmarks would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below, indicating the revisions we will incorporate to address the concerns raised.

read point-by-point responses
  1. Referee: [§4.2 and Table 1] §4.2 and Table 1: The manuscript reports large positive-class recall improvements but provides no information on evaluation dataset size, per-descriptor sample counts, train-test split ratios, or statistical significance testing. This information is required to assess whether the 111.8%, 36.1%, and 2.1% gains are reliable or could arise from variance or imbalance.

    Authors: We agree that these details are necessary to properly evaluate the reliability of the reported gains. The current manuscript provides only high-level dataset descriptions in §4. In the revised version we will expand §4.2 and Table 1 to report the total size of the evaluation set, the number of positive and negative samples per descriptor, the train-test split ratios employed, and the results of statistical significance tests (e.g., McNemar’s test) comparing QwenSafe against the baselines. These additions will allow readers to assess whether the observed improvements are robust to variance and class imbalance. revision: yes

  2. Referee: [§3.1] §3.1 (metadata2CRD pipeline): The evaluation set is generated by the same synthesis procedure used for training data, with no experiments or analysis addressing possible data leakage, keyword injection, or distribution shift. The central claim that QwenSafe performs robust joint metadata+screenshot reasoning therefore requires external validation on human-annotated real-world apps, which is not reported.

    Authors: We acknowledge the validity of this concern. The metadata2CRD pipeline relies on formal descriptor definitions rather than surface-level keywords, and we used disjoint app sets for training and evaluation to reduce direct leakage. Nevertheless, we did not include explicit ablation studies on keyword injection or distribution shift. In the revision we will add such analysis (e.g., performance after removing obvious keyword cues) and clarify the steps taken to ensure separation between splits. We agree that external validation on independently human-annotated real-world apps would provide stronger evidence of generalization beyond the synthetic distribution; we will explicitly note this as a limitation and outline it as future work. revision: partial

standing simulated objections not resolved
  • External validation on human-annotated real-world apps is not available in the current study and would require new data collection outside the scope of this work.

Circularity Check

0 steps flagged

No circularity in empirical evaluation of VLM fine-tuning pipeline

full rationale

The paper presents a standard empirical ML workflow: it introduces a data synthesis pipeline (metadata2CRD) to generate training pairs from app metadata, screenshots, and descriptor definitions, performs supervised fine-tuning followed by DPO on Qwen3-VL-8B, and reports direct performance metrics (positive-class recall improvements) against external baselines on a binary classification task for 12 descriptors. These metrics are measured outcomes on an evaluation set and do not reduce by any equations or definitions to quantities that are tautologically equivalent to the training inputs or fitted parameters. No mathematical derivations, self-citations, uniqueness theorems, or ansatzes are present in the provided text that would create a load-bearing circular chain. The central claims rest on observable model outputs rather than self-referential constructions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the synthesized training data and on the assumption that joint multimodal reasoning over metadata and screenshots is sufficient to determine CRD presence.

free parameters (1)
  • DPO and SFT hyperparameters
    The supervised fine-tuning and direct preference optimization stages involve multiple tunable parameters whose specific values are not reported.
axioms (1)
  • domain assumption App metadata and screenshots jointly contain sufficient evidence to determine the presence or absence of each CRD.
    The model is trained and evaluated under the premise that these two modalities are adequate inputs for the classification task.

pith-pipeline@v0.9.0 · 5807 in / 1413 out tokens · 42811 ms · 2026-05-21T06:06:26.466604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Apple Newsroom.: Apple expands tools to help parents protect kids and teens online. (2025),https://www.apple.com/au/newsroom/2025/06/ apple-expands-tools-to-help-parents-protect-kids-and-teens-online/#: ~:text=12%20June%202025-,Apple%20expands%20tools%20to%20help% 20parents%20protect%20kids%20and%20teens,they%20set%20up%20their% 20device

  2. [2]

    com/google-play-statistics-and-trends

    42matters: Google play statistics and trends 2025 (2025),https://42matters. com/google-play-statistics-and-trends

  3. [3]

    42matters: ios apple app store statistics and trends 2025 (2025),https:// 42matters.com/ios-apple-app-store-statistics-and-trends

  4. [4]

    Apple Inc.: Age ratings values and definitions (2025),https: //developer.apple.com/help/app-store-connect/reference/ age-ratings-values-and-definitions

  5. [5]

    Apple Inc.: Choosing a category.https://developer.apple.com/app-store/ categories/(2025)

  6. [6]

    Apple Inc.: Set an app age rating.https://developer.apple.com/help/ app-store-connect/manage-app-information/set-an-app-age-rating(2025)

  7. [7]

    austlii.edu.au/cgi-bin/viewdb/au/legis/cth/consol\_act/bsa1992214/ (1992)

    Australasian Legal Information Institute: Online content regulation.https://www. austlii.edu.au/cgi-bin/viewdb/au/legis/cth/consol\_act/bsa1992214/ (1992)

  8. [8]

    Australasian Legal Information Institute: ONLINE SAFETY ACT 2021 - SECT 105.https://www.austlii.edu.au/cgi-bin/viewdoc/au/legis/cth/ consol\_act/osa2021154/s105.html(2021)

  9. [9]

    Qwen3-VL Technical Report

    Bai, S., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  10. [10]

    Board, E.S.R.: Rating guide (1994),https://www.esrb.org/ratings-guide/

  11. [11]

    Canadian centre for child protection: Reviewing the enforcement of app age rat- ings in apple’s app store and google play.https://content.c3p.ca/pdfs/C3P\ _AppAgeRatingReport\_en.pdf(2022)

  12. [12]

    In: Proceedings of the 31st ACM international conference on multimedia

    Cao,R.,Hee,M.S.,Kuek,A.,Chong,W.H.,Lee,R.K.W.,Jiang,J.:Pro-cap:Lever- aging a frozen vision-language model for hateful meme detection. In: Proceedings of the 31st ACM international conference on multimedia. pp. 5244–5252 (2023)

  13. [13]

    Available at SSRN (2025)

    Carter, M., Zhangshao, T., Hardwick, T., Egliston, B., Xiao, L.Y.: Investigating mobile games’ compliance with australia’s 2024 mandatory minimum age classifi- cations scheme for gambling-like mechanics. Available at SSRN (2025)

  14. [14]

    In: Proceedings of the 22nd international conference on World Wide Web

    Chen, Y., Xu, H., Zhou, Y., Zhu, S.: Is this app safe for children? a comparison study of maturity ratings on Android and iOS applications. In: Proceedings of the 22nd international conference on World Wide Web. pp. 201–212 (2013)

  15. [15]

    arXiv preprint arXiv:2103.12407 (2021)

    Chiu, K.L., Collins, A., Alexander, R.: Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407 (2021)

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  17. [17]

    arXiv preprint arXiv:2502.15739 (2025)

    Denipitiyage, D., Silva, B., Seneviratne, S., Seneviratne, A., Chawla, S.: Detect- ing content rating violations in android applications: A vision-language approach. arXiv preprint arXiv:2502.15739 (2025)

  18. [18]

    Denipitiyage et al

    eSafety commissioner, Australia: Illegal and restricted online content."https:// www.esafety.gov.au/key-topics/Illegal-restricted-content(2024) 18 D. Denipitiyage et al

  19. [19]

    eSafety commissioner, Australia: Illegal and restricted online content (2024)

  20. [20]

    (2016),https://gdpr-info.eu/

    European General Data Protection Regulation: General data protection regulation gdpr. (2016),https://gdpr-info.eu/

  21. [21]

    Google: App Discovery with Google Play, Part 3: Machine Learning to Fight Spam and Abuse at Scale.https://research.google/blog/ app-discovery-with-google-play-part-3-machine-learning-to-fight-spa/ m-and-abuse-at-scale/(Mar 2015)

  22. [22]

    Google: Keeping google play safe for users and developers: June 29, 2023 (2023),https://support.google.com/googleplay/android-developer/answer/ 13721042?hl=en

  23. [23]

    google.com/googleplay/answer/6209544?hl=en

    Google: Apps & games content ratings on google play (2025),https://support. google.com/googleplay/answer/6209544?hl=en

  24. [24]

    In: 2023 International Conference on Machine Learning and Applications (ICMLA)

    Guo, K., Hu, A., Mu, J., Shi, Z., Zhao, Z., Vishwamitra, N., Hu, H.: An inves- tigation of large language models for real-world hate speech detection. In: 2023 International Conference on Machine Learning and Applications (ICMLA). pp. 1568–1573. IEEE (2023)

  25. [25]

    Haotian Liu, Chunyuan Li, Y.L., Lee, Y.J.: Improved baselines with visual instruc- tion tuning (2023)

  26. [26]

    In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

    Hu, B., Liu, B., Gong, N.Z., Kong, D., Jin, H.: Protecting your children from inappropriate content in mobile apps: An automatic maturity rating framework. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1111–1120 (2015)

  27. [27]

    Ibrahim, H.: Google play review times: Expectations and tips to streamline approval (Nov 2024),https://median.co/blog/ google-play-review-times-what-to-expect-and-how-to-streamline-approval

  28. [28]

    Interactive Software Federation of Europe (ISFE): Pegi-pan-european game infor- mation.http://www.pegi.info/en/index/id/952(2003)

  29. [29]

    (2025),https://www

    International Age Rating Coalition: How iarc works. (2025),https://www. globalratings.com/how-iarc-works.aspx

  30. [30]

    com/iphone-apps/95993/11-iphone-apps-that-got-banned-and-why

    Jensen, K.T.: 11 iphone apps that got banned and why (2022),https://au.pcmag. com/iphone-apps/95993/11-iphone-apps-that-got-banned-and-why

  31. [31]

    Advances in neural information processing systems33, 2611–2624 (2020)

    Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems33, 2611–2624 (2020)

  32. [32]

    In: Proceedings of the 17th International Workshop on Mobile Computing Systems and Applications

    Liu, M., Wang, H., Guo, Y., Hong, J.: Identifying and analyzing the privacy of apps for kids. In: Proceedings of the 17th International Workshop on Mobile Computing Systems and Applications. pp. 105–110 (2016)

  33. [33]

    In: Proceedings of the AAAI conference on artificial intelligence

    Mathew,B.,Saha,P.,Yimam,S.M.,Biemann,C.,Goyal,P.,Mukherjee,A.:Hatex- plain: A benchmark dataset for explainable hate speech detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14867–14875 (2021)

  34. [34]

    org/film-ratings/

    Motion Picture Association: Film rating (1968),https://www.motionpictures. org/film-ratings/

  35. [35]

    ecfr.gov/current/title-16/chapter-I/subchapter-C/part-312

    National Archives: Children’s online privacy protection rule (2022),https://www. ecfr.gov/current/title-16/chapter-I/subchapter-C/part-312

  36. [36]

    google.com/console/about/programs/families/(2015)

    Play Store: Creating apps and games for children and families.https://play. google.com/console/about/programs/families/(2015)

  37. [37]

    Advances in neural information processing systems36, 53728–53741 (2023) QwenSafe 19

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023) QwenSafe 19

  38. [38]

    (2025),https: //www.classification.gov.au/classification-ratings/what-are-ratings

    Regional Development Department of Infrastructure, Transport and Communi- cation.: The advisory categories for films and computer games. (2025),https: //www.classification.gov.au/classification-ratings/what-are-ratings

  39. [39]

    In: Proceedings of the ACM Web Conference 2023

    Sun, R., Xue, M., Tyson, G., Wang, S., Camtepe, S., Nepal, S.: Not seen, not heard in the digital world! measuring privacy practices in children’s apps. In: Proceedings of the ACM Web Conference 2023. pp. 2166–2177 (2023)

  40. [40]

    (2025),https://usk

    Unterhaltungssoftware Selbstkontrolle: SK age categories. (2025),https://usk. de/en/the-usk/faqs/age-categories/

  41. [41]

    In: Proceedings of the ACM on Web Conference 2025

    Wang, H., Tan, R.Y., Lee, R.K.W.: Cross-modal transfer from memes to videos: Addressing data scarcity in hateful video detection. In: Proceedings of the ACM on Web Conference 2025. pp. 5255–5263 (2025)

  42. [42]

    Royal Society Open Science12(5), 250704 (2025)

    Xiao, L.Y., Lund, M.L.: Non-compliance with and non-enforcement of uk loot box industry self-regulation on the apple app store: a longitudinal study on poor implementation. Royal Society Open Science12(5), 250704 (2025)

  43. [43]

    In: Proceedings of the 13th Asia-Pacific Symposium on Internetware

    Zhou, C., Zhan, X., Li, L., Liu, Y.: Automatic maturity rating for Android apps. In: Proceedings of the 13th Asia-Pacific Symposium on Internetware. pp. 16–27 (2022) A Appendix Due to space constraints, we provide the complete multi-class classification re- sultsacrossalldescriptorsinTable3.Thistablereportsmildandstrongprecision and recall for all methods...