arxiv: 2605.01416 · v1 · submitted 2026-05-02 · 💻 cs.CY · cs.CL

Recognition: unknown

Who Decides What Is Harmful? Content Moderation Policy Through A Multi-Agent Personalised Inference Framework

Ewelina Gajewska , Michal Wawer , Katarzyna Budzynska , Jaroslaw A. Chudziak

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:15 UTC · model grok-4.3

classification 💻 cs.CY cs.CL

keywords content moderationmulti-agent systemspersonalized inferenceLLMuser sensitivityharmful contentdigital rightsplatform governance

0 comments

The pith

A multi-agent LLM framework personalizes content moderation to individual user sensitivity profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional content moderation applies uniform rules across all users, but perceptions of harm vary widely from person to person. This paper proposes a system of specialized AI agents that analyze content and simulate each user's perspective to make tailored moderation choices. The framework includes expert agents for different aspects of content, a manager to direct the process, and a ghost agent that stands in for the individual user. Tests show the personalized approach aligns better with what users actually consider harmful, with accuracy gains of up to 32 percent. If effective at scale, it could shift platform policies toward greater respect for personal digital autonomy.

Core claim

The authors claim that their LLM-based multi-agent personalised inference framework, which integrates domain-specific Expert Agents, a Manager Agent for orchestration, and a Ghost Profile Agent for simulating user perspectives, produces moderation decisions that better match individual perceptions of harm than non-personalized systems.

What carries the argument

The multi-agent personalised inference framework combining domain-specific Expert Agents, a Manager Agent for orchestrating analysis and selection, and a Ghost Profile Agent for simulating user perspectives to inform moderation decisions.

If this is right

Moderation policies can scale while accommodating subjective differences in harm perception across users.
Platforms receive a concrete method to reconcile centralized rules with individual digital rights and autonomy.
The architecture supplies policy insights for governance that balance societal standards and personal sensitivities.
Accuracy improvements of up to 32 percent over non-personalised baselines become possible through agent-based personalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could reduce widespread user complaints about over-moderation or under-moderation by adapting per profile.
Similar agent structures might apply to other subjective platform decisions such as recommendation or privacy controls.
Implementation would require safeguards to prevent the simulation from encoding biased or incomplete user models.
Testing across different cultural or demographic groups would reveal whether the personalization generalizes beyond the evaluated cases.

Load-bearing premise

The Ghost Profile Agent can reliably simulate individual users' subjective perceptions of harm and that this simulation produces moderation decisions that genuinely align with real user sensitivities.

What would settle it

A direct comparison study measuring how often the system's moderation outputs match real users' own judgments on identical content samples.

Figures

Figures reproduced from arXiv: 2605.01416 by Ewelina Gajewska, Jaroslaw A. Chudziak, Katarzyna Budzynska, Michal Wawer.

**Figure 1.** Figure 1: PRISM architecture for personalised hate speech moderation. The system loads userspecific sensitivity profiles, employs specialised agents for multi-dimensional analysis, and synthesises personalised filtering decisions. User feedback enables continuous profile adaptation. 3.1 User Profile Construction and Management Each user profile 𝑃𝑢 comprises three core components that capture the user's personalised… view at source ↗

**Figure 2.** Figure 2: Algorithmic representation of a adaptive learning. 3.4 Content Filtering Workflow When new content 𝑐 arrives for user 𝑗 : 1. Load user's current sensitivity profile 𝑃𝑗 from SQLite database 2. Manager agent orchestrates expert evaluation, producing scores {𝑠𝑖 (𝑑) , 𝑖 ∈ 𝑎𝑔𝑒𝑛𝑡𝑠 , 𝑑 ∈ 𝐷} 3. Synthesis agent generates weighted decision score 4. Compare score against user's personalised threshold 5. Hide content … view at source ↗

**Figure 3.** Figure 3: Number of users and the number of their annotation decisions in the dataset. This aggregation enables us to capture broader patterns in annotation behaviour while mitigating sparsity issues caused by limited contributions from single users. In this way, we preserve variation between groups while ensuring that each profile contains a sufficient number of annotation decisions for rigorous quantitative analys… view at source ↗

read the original abstract

The increasing scale and complexity of online platforms raises critical policy questions around harmful content, digital well-being, and user autonomy. Traditional content moderation systems rely on centralised, top-down rules, often failing to accommodate the subjective nature of harm perception. This paper proposes an LLM-based multi-agent personalised inference framework that filters content based on unique sensitivity profiles of individual users. Our architecture combines domain-specific Expert Agents, a Manager Agent for orchestrating content analysis and agent selection, and a Ghost Profile Agent for simulating user perspectives, to inform moderation decisions. Evaluated against a range of non-personalised baselines, the system demonstrates up to a 32% improvement in accuracy, showing increased alignment with individual user sensitivities. Beyond technical performance, our framework provides policy-relevant insights for platform governance, providing a scalable way to reconcile moderation policies with societal and individual digital rights

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent setup with a Ghost Profile Agent offers a novel way to handle subjective harm in moderation, but the 32% accuracy claim has no visible data or validation behind it.

read the letter

The paper's key contribution is a multi-agent framework that uses LLM agents to personalize content moderation around individual users' perceptions of harm. Specifically, it introduces a Ghost Profile Agent to simulate user sensitivities, working with expert agents and a manager agent to decide on content. This is presented as a way to improve on rigid centralized systems. What the work does well is to take the subjective nature of harm seriously and try to build a technical response that could scale. The abstract ties the technical setup to policy questions around user autonomy and digital rights, which shows an awareness of the broader context beyond just accuracy numbers. The reported up to 32% accuracy improvement over non-personalised baselines sounds promising on the surface. However, the description provides no details on the evaluation setup, the datasets involved, the exact baselines, or any statistical validation. This makes it difficult to assess whether the gains are meaningful or reproducible. The central assumption that the Ghost Profile Agent can reliably stand in for real users' subjective views is the soft spot here. Without evidence from human participants or checks on how well the simulations match actual ratings, the alignment benefit could be more apparent than real. LLM-based proxies often carry their own biases, and harm perceptions vary widely, so this needs direct testing. The architecture itself seems logically structured and distinct from standard approaches, so the novelty in applying multi-agent systems this way holds up based on what's described. This paper would be of interest to colleagues working on AI applications in social media governance or those exploring multi-agent systems for decision-making tasks. It offers a conceptual advance that could spark ideas, even if the current version is light on empirical backing. I think it deserves to go to peer review. The idea is coherent and tackles an important problem, so referees could help strengthen the evaluation side and clarify the claims.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an LLM-based multi-agent framework for personalized content moderation. It combines domain-specific Expert Agents, a Manager Agent for orchestration and agent selection, and a Ghost Profile Agent to simulate individual users' subjective harm perceptions. The system is evaluated against non-personalized baselines and claims up to 32% accuracy improvement with better alignment to user sensitivities, while also offering policy insights for platform governance and digital rights.

Significance. If the evaluation were properly grounded, the framework could offer a scalable technical approach to reconciling centralized moderation policies with individual differences in harm perception, which is a persistent challenge in content moderation research. The multi-agent design provides a concrete architecture for personalization that could inform governance discussions, but the current lack of validation against human data limits its contribution to either technical or policy literature.

major comments (2)

[Abstract] Abstract: The central claim of 'up to a 32% improvement in accuracy' is presented without any information on the evaluation dataset, baseline definitions, evaluation protocol, statistical tests, or error analysis. This information is required to determine whether the reported gain reflects genuine alignment with user sensitivities or internal consistency within the LLM agents.
[Framework and Evaluation sections] Framework and Evaluation sections: The Ghost Profile Agent is described as simulating user perspectives to produce personalized moderation decisions, yet no correlation, user study, or human-labeled validation is reported between the simulated profiles and actual participant ratings of harm. Without this grounding, the accuracy metric risks measuring LLM self-consistency rather than external validity, which directly undermines the alignment claim.

minor comments (1)

[Abstract] Abstract: The phrase 'increased alignment with individual user sensitivities' is used without defining the alignment metric or how it was quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where additional transparency will strengthen the manuscript. We address each major comment below and commit to revisions that clarify the evaluation details and explicitly discuss the simulation-based approach.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'up to a 32% improvement in accuracy' is presented without any information on the evaluation dataset, baseline definitions, evaluation protocol, statistical tests, or error analysis. This information is required to determine whether the reported gain reflects genuine alignment with user sensitivities or internal consistency within the LLM agents.

Authors: We agree that the abstract requires more supporting context for the central claim. In the revised manuscript, we will expand the abstract to briefly specify the evaluation dataset (content samples paired with diverse simulated sensitivity profiles), the non-personalized baselines (standard LLM classifiers and heuristic moderation), the protocol (comparative accuracy against profile-aligned decisions), and references to statistical tests and error analysis detailed in the Evaluation section. This will help readers assess whether the 32% gain reflects alignment with simulated sensitivities. revision: yes
Referee: [Framework and Evaluation sections] Framework and Evaluation sections: The Ghost Profile Agent is described as simulating user perspectives to produce personalized moderation decisions, yet no correlation, user study, or human-labeled validation is reported between the simulated profiles and actual participant ratings of harm. Without this grounding, the accuracy metric risks measuring LLM self-consistency rather than external validity, which directly undermines the alignment claim.

Authors: This observation is correct and highlights a genuine limitation of the current work. The evaluation measures alignment between the multi-agent outputs and decisions generated from the Ghost Profile Agent's simulated sensitivities, without human participant data or correlation studies. We will revise the Framework and Evaluation sections to state this limitation explicitly, elaborate on the profile simulation method (including prompt design for consistency), and add a dedicated future-work subsection outlining plans for human validation studies. The framework remains a technical contribution for scalable personalization, but we agree external human grounding is needed to fully support the alignment claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation of multi-agent framework

full rationale

The paper presents an LLM-based multi-agent architecture (Expert Agents, Manager Agent, Ghost Profile Agent) for personalized content moderation and reports an empirical result: up to 32% accuracy improvement over non-personalised baselines. No equations, derivations, or load-bearing steps are described that reduce this claim to a self-definition, fitted parameter renamed as prediction, or self-citation chain. The accuracy metric is positioned as an external comparison, not internally forced by the framework's own logic or assumptions. This is a standard non-circular empirical claim, consistent with the default expectation and the provided reader's assessment of minimal circularity concern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that LLMs can accurately simulate subjective human harm perceptions through the introduced Ghost Profile Agent; no free parameters are explicitly named in the abstract, but the framework implicitly depends on prompt engineering choices and agent orchestration rules that are not detailed.

axioms (1)

domain assumption LLMs can effectively simulate individual user perspectives on harmful content via a Ghost Profile Agent
This is the core mechanism enabling personalization but receives no justification or validation details in the abstract.

invented entities (1)

Ghost Profile Agent no independent evidence
purpose: Simulate unique user sensitivity profiles to inform personalized moderation decisions
New component introduced by the paper; no independent evidence or falsifiable test outside the framework is described.

pith-pipeline@v0.9.0 · 5459 in / 1452 out tokens · 59363 ms · 2026-05-09T18:15:35.929626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

I always assumed that I wasn't really that close to [her]

Antypas, D., & Camacho-Collados, J. (2023, July). Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation. In The 7th Workshop on Online Abuse and Harms (WOAH) (pp. 231-242). Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, & Thomas L. Griffiths. (2025). Are Large Language Models Sensitive to the Motives Behind Communic...

work page arXiv 2023
[2]

1884, pp

(Vol. 1884, pp. 35-42). Kennedy, C., Bacon, G., Sahn, A., & Vacano, C. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application . arXiv preprint arXiv:2009.10277. Kocoń, J., Gruza, M., Bielaniewicz, J., Grimling, D., Kanclerz, K., Miłkowski, P., & Kazienko, P. (2021, December). Learning p...

work page doi:10.5220/0014309200004052 2020