arxiv: 2604.15937 · v1 · submitted 2026-04-17 · 💻 cs.SI · cs.AI· cs.CL· cs.CY· cs.MA

Recognition: unknown

Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation

Chris Andrew Bail, Christopher Barrie, Nicol\`o Pagan, Petter T\"ornberg

Pith reviewed 2026-05-10 07:18 UTC · model grok-4.3

classification 💻 cs.SI cs.AIcs.CLcs.CYcs.MA

keywords LLM recommendation biascontent curationpolarization amplificationtoxicity inversionsentiment biassocial media datasetsprovider comparisonprompt sensitivity

0 comments

The pith

LLM-based content curators amplify polarization across all providers and prompt strategies

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how large language models from OpenAI, Anthropic, and Google select and rank posts from real social media datasets. It runs hundreds of thousands of controlled top-10 selections from fixed pools using six different prompts that range from neutral to engagement-focused. The results show polarization increases in every case, toxicity handling flips sharply depending on whether the prompt emphasizes engagement or information, and sentiment leans negative overall. Provider differences appear as well, with one model staying consistent while others adapt more on toxicity. The findings matter because platforms are already handing content ranking to these models, so any default bias shapes what millions of users see.

Core claim

Through 540,000 simulated top-10 selections from pools of 100 posts across 54 conditions on Twitter/X, Bluesky, and Reddit data, polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality, and this pattern largely persists across prompts.

What carries the argument

Controlled simulation of top-10 selections from fixed pools of 100 posts using six prompting strategies (general, popular, engaging, informative, controversial, neutral) across three LLM providers, which isolates structural biases from prompt effects.

If this is right

Polarization increases in LLM-curated feeds even when prompts are designed to be neutral or informative.
Toxicity selection can be inverted by shifting from engagement to information prompts, but polarization remains elevated.
Sentiment bias stays negative across most prompt and provider combinations.
Political leaning bias favors left-leaning authors on Twitter/X data regardless of prompt wording.
Different providers trade off consistency against adaptivity, so platform choice affects which biases dominate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real deployed systems may produce even stronger polarization once user history and engagement loops are added to the selection process.
Auditing standards for LLM recommenders could focus on measuring polarization deltas rather than absolute toxicity scores.
Prompt engineering alone is unlikely to eliminate demographic skews such as the over-representation of left-leaning authors.
Testing the same simulation on newer model releases would reveal whether the observed provider trade-offs persist or shift.

Load-bearing premise

The simulation of selections from fixed post pools with static prompts accurately reflects the biases that appear when LLMs curate content in live, dynamic user contexts on real platforms.

What would settle it

Running the same top-10 selection task on a live platform feed and measuring whether polarization metrics rise by the same amount as in the fixed-pool simulations.

Figures

Figures reproduced from arXiv: 2604.15937 by Chris Andrew Bail, Christopher Barrie, Nicol\`o Pagan, Petter T\"ornberg.

**Figure 1.** Figure 1: R2 (variance explained) for each of the 13 features across six prompt strategies, averaged over three providers and three platforms (demographic features: Twitter/X only). Rows ordered by average effect size; the average column summarizes overall bias strength. Significance markers: * = p<0.05 in >50% of conditions, ** = >60%, *** = >75%. length varies 16-fold across prompts; political leaning varies 4 tim… view at source ↗

**Figure 2.** Figure 2: Content and safety directional bias by model and prompt style. Three heatmaps show polarization, sentiment polarity, and toxicity directional bias averaged across Bluesky, Reddit, and Twitter/X. Positive values (red) indicate preference for higher values (respectively, more polarized, more positive, or more toxic content); negative values (blue) indicate preference for lower values (respectively, less pola… view at source ↗

**Figure 3.** Figure 3: Directional bias in sensitive demographic attributes for Twitter/X (demographic [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Average R2 (variance explained) for each feature aggregated across all 54 experimental conditions (3 datasets × 3 models × 6 prompts), ordered by effect size. Demographic attributes (author gender, political leaning, minority status) are computed for Twitter/X only; all other features average across all three platforms. Significance markers: * = p<0.05 in >50% of conditions, ** = >60%, *** = >75%. 20 [PI… view at source ↗

**Figure 5.** Figure 5: Normalized bias (z-scores) for each feature across six prompt strategies. Values [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Average word length directional bias by dataset, model, and prompt style. A [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Content-feature directional bias by dataset, model, and prompt style. All three [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Primary-topic directional bias by dataset. Each cell shows the mean over-/under [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Demographic directional bias by model and prompt style (Twitter/X only; LLM [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Feature importance by model. SHAP importance values aggregated across datasets and prompts. Features ordered by decreasing average importance (left to right); bottom row shows cross-model averages. Claude shows the highest polarization and toxicity reliance; GPT-4o-mini exhibits the most balanced feature usage. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textit{general}, \textit{popular}, \textit{engaging}, \textit{informative}, \textit{controversial}, \textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This audit finds polarization gets amplified by default in LLM content selection, but the fixed small pools limit how far the results generalize to actual platforms.

read the letter

The key point is that LLMs amplify polarization in content curation across providers and prompts, but the small fixed post pools used here may not reflect how these systems behave at platform scale. The work stands out for its scale and structure. Running 540,000 simulations on real posts from three platforms with three LLMs and six different prompts lets them show which biases stick around no matter what you ask the model to do. The toxicity inversion—where engaging prompts increase toxicity but informative ones decrease it—is a clear finding, and the provider differences add practical value. GPT-4o Mini behaves more steadily while Claude and Gemini shift more with the prompt. The Twitter demographic result, with left-leaning authors favored despite the pool, is also worth noting. They do a decent job keeping the experiment controlled and reporting the patterns directly from the data. The soft spot sits in the methods. Selecting from pools of only 100 posts with no user history or dynamic retrieval means the results could be tied to that artificial setup. If larger pools or real user signals change the selection, the amplification might look different. The paper would be stronger with some checks on how sensitive the findings are to pool size or added context. Readers interested in AI governance or social media algorithms will find the prompt comparisons useful. It raises good questions about default biases in LLM curation. I would send this to peer review. The empirical effort is substantial and the topic timely, even if the simulation needs more grounding in real deployment conditions.

Referee Report

1 major / 1 minor

Summary. The paper presents a controlled simulation study of biases in LLM-based content curation across three providers (OpenAI, Anthropic, Google) and six prompting strategies (general, popular, engaging, informative, controversial, neutral). Using 540,000 top-10 selections from fixed 100-post pools drawn from Twitter/X, Bluesky, and Reddit datasets, it reports that polarization is amplified in all configurations, toxicity handling inverts between engagement- and information-focused prompts, sentiment biases are predominantly negative, GPT-4o Mini is most consistent while Claude and Gemini show high adaptivity, Gemini has the strongest negative sentiment preference, and left-leaning authors are over-represented on Twitter/X despite pool plurality of right-leaning authors.

Significance. If the directional patterns hold beyond the simulation, the work would be significant for social information systems research by providing a comparative audit of biases in LLM recommenders and identifying prompt-sensitive versus structural effects. The scale of the experiment and cross-provider analysis offer a useful empirical baseline for future auditing and mitigation efforts in AI-driven content platforms.

major comments (1)

[Methods / Experimental Setup] The central claims about polarization amplification, toxicity inversion, and sentiment biases rest on top-10 selections from static pools of 100 posts using six fixed, hand-crafted prompts without user history, engagement signals, or large-scale dynamic retrieval. This controlled setup is load-bearing for generalizing the findings to deployed LLM curation systems, which typically operate over millions of items with personalization and platform-specific fine-tuning; the observed effects could be artifacts of pool size and lack of context rather than robust LLM properties.

minor comments (1)

[Abstract / Results] The abstract and results sections would benefit from explicit reporting of the statistical tests or confidence intervals used to establish the directional findings (e.g., over-representation of left-leaning authors) rather than relying solely on descriptive patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential significance of our comparative audit of LLM curation biases. We address the major comment on the experimental setup below, clarifying our design rationale while acknowledging its limitations for generalization.

read point-by-point responses

Referee: The central claims about polarization amplification, toxicity inversion, and sentiment biases rest on top-10 selections from static pools of 100 posts using six fixed, hand-crafted prompts without user history, engagement signals, or large-scale dynamic retrieval. This controlled setup is load-bearing for generalizing the findings to deployed LLM curation systems, which typically operate over millions of items with personalization and platform-specific fine-tuning; the observed effects could be artifacts of pool size and lack of context rather than robust LLM properties.

Authors: We agree that the controlled simulation uses static pools of 100 posts and fixed prompts without user history or dynamic retrieval, and that this design limits direct generalization to production systems with millions of items and personalization. The setup was chosen deliberately to isolate LLM and prompt effects by holding the input pool constant across 54 conditions and 540,000 selections, enabling attribution of polarization amplification, toxicity inversion, and sentiment biases to the models themselves rather than retrieval confounders. The consistent polarization increase across all providers, prompts, and datasets suggests a structural pattern rather than a pool-size artifact. We do not claim equivalence to deployed systems. In revision we will expand the Limitations section to explicitly discuss these constraints, note that real-world interactions with personalization remain untested, and outline future work on dynamic and personalized setups. This partially addresses the concern while preserving the value of the current baseline audit. revision: partial

Circularity Check

0 steps flagged

Empirical simulation with no circular derivations or self-referential reductions

full rationale

The paper performs a controlled empirical audit by generating 540,000 top-10 selections from static 100-post pools using six fixed prompting strategies across three LLM providers. All reported patterns (polarization amplification, toxicity inversion, negative sentiment bias, provider differences) are direct observational outputs from these simulations rather than outputs of any mathematical derivation, fitted parameter, or equation that reduces to the input data by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the methodology is self-contained and externally falsifiable via replication of the described simulation protocol. This matches the default case of an honest non-finding for an observational study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the chosen simulation setup and prompting strategies produce biases representative of real curation without external validation against deployed systems.

axioms (2)

domain assumption The six prompting strategies (general, popular, engaging, informative, controversial, neutral) represent distinct and meaningful real-world use cases for content curation.
Invoked in the experimental design to interpret prompt sensitivity of biases.
domain assumption Top-10 selections from fixed pools of 100 posts simulate the core selection mechanism of LLM-based recommendation systems.
Central to mapping observed biases to deployed behavior.

pith-pipeline@v0.9.0 · 5580 in / 1516 out tokens · 90123 ms · 2026-05-10T07:18:39.062041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[3]

Rnٲewtƌ XP

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv