arxiv: 2604.02669 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Divyanshu Kumar , Ishita Gupta , Nitin Aravind Birur , Tanay Baswa , Sahil Agarwal , Prashanth Harshangi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords task-dependent biasLLM stereotypingsafety alignment limitsimplicit associationcaste biasrepresentational harmbenchmark evaluationexplicit vs implicit tasks

0 comments

The pith

Bias in LLMs is task-dependent, with models suppressing stereotypes in explicit questions but reproducing them in implicit association tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that apparent bias levels in language models change sharply based on the probing method. Models refuse to assign negative stereotypes when asked directly but reliably link them to groups in fill-in-the-blank or indirect association prompts. A new taxonomy of nine bias types, tested across seven tasks and roughly 45,000 prompts on seven models, uncovers consistent patterns including large score gaps between task types. This matters because single-benchmark checks can falsely suggest that alignment has removed bias when it has only redirected it. The work highlights stronger effects on less-studied axes such as caste, linguistic, and geographic bias.

Core claim

The central claim is that bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity.

What carries the argument

Hierarchical taxonomy of nine bias types operationalized through seven evaluation tasks spanning explicit decision-making to implicit association, quantified via Stereotype Score.

If this is right

Single-task benchmarks systematically mischaracterize the full bias profile of any given model.
Current alignment techniques redirect rather than eliminate representational harm.
Under-studied bias axes such as caste receive less mitigation and exhibit higher stereotyping.
Audits limited to explicit decision tasks will understate bias for the same model and groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future alignment training may need explicit coverage of implicit association formats to close the observed gaps.
Bias measurement protocols should standardize a range of task types to prevent over-optimistic safety claims.
The asymmetry in positive versus negative trait assignment points to surface-level output filters rather than changed internal representations.

Load-bearing premise

The seven tasks and nine-type taxonomy isolate distinct stereotyping forms without introducing artifacts from prompt wording or task framing.

What would settle it

Finding Stereotype Scores that remain consistent across all seven tasks with no gaps near 0.43, or symmetric refusal of both positive and negative trait associations, would undermine the task-dependence claim.

Figures

Figures reproduced from arXiv: 2604.02669 by Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Prashanth Harshangi, Sahil Agarwal, Tanay Baswa.

**Figure 1.** Figure 1: Three-level hierarchical taxonomy: bias types define evaluation axes, themes identify social contexts, and topics anchor prompt generation. We organize evaluation through three levels ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Stereotype Score (SS) per model and task. Tasks are ordered left-to-right from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Models are 4–10× more likely to refuse assigning a harmful trait to a marginalised group than to assign a positive trait to a privileged one. Race, partisan, and caste show the largest gaps; SES and linguistic show near-zero asymmetry, receiving no directional protection. Bias Type Neg. refusal Pos. refusal ∆ Race 0.136 0.034 +0.102 Partisan 0.135 0.035 +0.100 Caste 0.087 0.010 +0.077 Geographic 0.057 0.00… view at source ↗

**Figure 4.** Figure 4: Under-studied axes (orange, ≤2 benchmarks) show higher Stereotype Scores than well-studied axes (blue, ≥4 benchmarks) for every model in our study. Bubble size proportional to total prompts. The pattern holds regardless of axis size or model family. Bias Type Avg SS Refusal Benchmarks Partisan 0.838 0.096 2 SES 0.758 0.123 2 Caste 0.727 0.174 1 Health 0.724 0.069 2 Linguistic 0.711 0.126 1 Gender 0.654 0.1… view at source ↗

**Figure 5.** Figure 5: SES (69%) and caste (55%) show the highest stereotype-present rates on Sentence [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Second, safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Third, under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity. These results demonstrate that single-benchmark audits systematically mischaracterize LLM bias and that current alignment practices mask representational harm rather than mitigating it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task-dependent stereotyping shows up clearly in the numbers but the tasks may not isolate explicit vs implicit cleanly enough.

read the letter

Models refuse stereotypes on direct questions but fall back into them on indirect prompts, and this paper measures that shift across seven models and nine bias axes with a 45k-prompt audit. The concrete observations are the score gaps up to 0.43 between task types and the asymmetry where negative traits get blocked for marginalized groups while positive traits attach freely to privileged ones. Those patterns are new in this multi-task, multi-axis setup, and the inclusion of caste, linguistic, and geographic bias is a useful expansion beyond the usual gender and race probes. The work stays empirical, reports raw divergences without fitted parameters, and the scale gives it some weight for people who run bias checks on deployed models. The patterns line up with what alignment training tends to produce: refusal on the obvious bad cases and less control elsewhere. The soft spot is exactly the one the stress-test note flags. The seven tasks differ in format, length, and likely refusal triggers, so the observed task dependence could partly reflect those surface choices rather than a clean explicit-implicit split. The abstract gives no error bars, no statistical tests, and no prompt examples, which leaves the 0.43 gap hard to evaluate for robustness. If the full methods section shows tight controls on those variables, the claim strengthens; otherwise it stays suggestive. This is the sort of measurement paper that fairness and alignment groups should see because single-benchmark audits clearly miss the variation. It deserves a serious referee to examine the task templates and any statistical backing, even if the conclusions need more qualification on the confounds.

Referee Report

2 major / 2 minor

Summary. The paper audits seven LLMs (~45K prompts) using a hierarchical taxonomy of nine bias types (including caste, linguistic, and geographic) operationalized via seven tasks spanning explicit decision-making to implicit association. It claims three patterns: (1) bias is task-dependent, with models countering stereotypes on explicit probes but reproducing them on implicit ones (Stereotype Score divergences up to 0.43); (2) safety alignment is asymmetric, refusing negative traits for marginalized groups while freely associating positive traits with privileged ones; (3) under-studied bias axes exhibit the strongest stereotyping, implying alignment tracks benchmark coverage rather than harm. The central conclusion is that single-benchmark audits mischaracterize LLM bias and that current alignments mask rather than mitigate representational harm.

Significance. If the task-dependence and asymmetry results hold after addressing prompt-construction details, the work provides a useful empirical demonstration that alignment techniques are brittle across elicitation formats. The inclusion of under-studied axes (caste, linguistic, geographic) and the scale of the audit (~45K prompts) strengthen the case that representational harms are systematically under-measured by existing single-task benchmarks. The paper does not ship machine-checked proofs or parameter-free derivations, but the reproducible prompt set and multi-model comparison constitute a concrete contribution to bias-evaluation methodology.

major comments (2)

[§3.2] §3.2 (Task Operationalization): The seven tasks differ systematically in prompt length, presence of refusal-triggering language, use of few-shot examples, and output format (forced choice vs. open completion). These surface differences are not controlled for in the reported Stereotype Score calculations, so the observed divergences of up to 0.43 could be driven by framing artifacts rather than by an explicit-vs-implicit distinction. A concrete test would be to re-run the implicit tasks with the lexical and structural features of the explicit tasks (or vice versa) and report whether the gap persists.
[§4] §4 (Results) and Table 4: No error bars, confidence intervals, or statistical significance tests are reported for the Stereotype Score differences across task types or models. Given that the headline claim rests on these numerical divergences, the absence of uncertainty quantification makes it impossible to assess whether the 0.43 gap is robust or within the range of prompt-sampling variability.

minor comments (2)

[Abstract] The abstract states the audit covers “~45K prompts” but does not specify the exact breakdown per task or per bias axis; adding this table (or a supplementary count) would improve reproducibility.
[§3.3] Notation for the Stereotype Score is introduced without an explicit equation; a short formal definition (e.g., Eq. (1) in §3.3) would clarify how positive/negative trait associations are aggregated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the empirical robustness of our claims about task-dependent stereotyping. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Task Operationalization): The seven tasks differ systematically in prompt length, presence of refusal-triggering language, use of few-shot examples, and output format (forced choice vs. open completion). These surface differences are not controlled for in the reported Stereotype Score calculations, so the observed divergences of up to 0.43 could be driven by framing artifacts rather than by an explicit-vs-implicit distinction. A concrete test would be to re-run the implicit tasks with the lexical and structural features of the explicit tasks (or vice versa) and report whether the gap persists.

Authors: We agree that surface-level differences in prompt construction exist across tasks and were not explicitly controlled in the original analysis. These differences are partly inherent to the explicit (decision-making with potential refusals) versus implicit (free association) paradigms we aimed to contrast. To address the concern directly, we will add a controlled ablation in the revision: we will re-run a subset of implicit tasks after standardizing prompt length, removing refusal-triggering phrases, eliminating few-shot examples, and enforcing consistent output formats with the explicit tasks. We will report the resulting Stereotype Score divergences to determine whether the task-dependence pattern holds after controlling for these framing factors. revision: yes
Referee: [§4] §4 (Results) and Table 4: No error bars, confidence intervals, or statistical significance tests are reported for the Stereotype Score differences across task types or models. Given that the headline claim rests on these numerical divergences, the absence of uncertainty quantification makes it impossible to assess whether the 0.43 gap is robust or within the range of prompt-sampling variability.

Authors: We acknowledge the lack of uncertainty quantification in the reported results. In the revised manuscript, we will add bootstrap confidence intervals (e.g., 95% CI over 1000 resamples of the prompt set) for all Stereotype Score values and differences. We will also include paired statistical tests (Wilcoxon signed-rank or t-tests, as appropriate) comparing scores across task types within each model and bias axis, with p-values and effect sizes. These additions will allow readers to assess whether the observed divergences, including the maximum of 0.43, exceed sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with direct measurements

full rationale

This paper performs an empirical evaluation by constructing 7 tasks, a 9-type bias taxonomy, and ~45K prompts, then measuring Stereotype Scores directly from LLM responses across 7 models. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The reported divergences (e.g., up to 0.43) are computed outputs from the task responses, not self-referential loops. Self-citations, if any, are not load-bearing for the central claims, which rest on the new prompt-based measurements rather than prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; full definitions of Stereotype Score, task operationalizations, and taxonomy details unavailable.

axioms (1)

domain assumption The seven tasks validly distinguish explicit decision-making from implicit association without confounding effects from prompt wording.
Central to interpreting task-dependent differences as evidence of redirected bias.

pith-pipeline@v0.9.0 · 5555 in / 1113 out tokens · 46381 ms · 2026-05-13T20:01:09.086077+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical taxonomy covering 9 bias types... 7 evaluation tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page