arxiv: 2604.05224 · v1 · submitted 2026-04-06 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Attribution Bias in Large Language Models

Eliza Berman , Bella Chang , Daniel B. Neill , Emily Black

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords quote attributionattribution biaslarge language modelsdemographic biassuppressionrepresentational fairnessAttriBenchinformation retrieval

0 comments

The pith

Large language models display systematic disparities in quote attribution accuracy across race, gender, and intersectional groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces AttriBench, the first quote attribution benchmark dataset balanced for author fame and demographics, to enable controlled tests of bias in how LLMs credit sources. Evaluations of 11 models across prompt settings show large accuracy gaps by race, gender, and their combinations, plus a distinct failure mode called suppression where models omit attribution even when they have the information. Suppression rates also vary unevenly by group, meaning overall accuracy numbers hide additional fairness problems. A reader would care because LLMs increasingly power search and information retrieval, so biased attribution can distort credit for ideas and content.

Core claim

AttriBench is constructed as a fame- and demographically-balanced quote attribution benchmark to isolate demographic effects. Testing reveals large and systematic disparities in attribution accuracy between race, gender, and intersectional groups. Suppression, a distinct failure mode in which models omit attribution entirely despite access to authorship information, proves widespread and unevenly distributed across demographic groups, exposing systematic biases not captured by standard accuracy metrics.

What carries the argument

AttriBench, a quote attribution benchmark dataset balanced by author fame and demographics to support controlled investigation of bias effects on accuracy and suppression.

If this is right

Quote attribution accuracy is not uniform but shows large systematic disparities by author race, gender, and intersections.
Suppression occurs frequently and varies across groups, so accuracy alone understates the fairness issue.
These patterns hold across different prompt settings in frontier models.
Quote attribution can serve as a benchmark for measuring representational fairness beyond overall performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these patterns hold in deployed systems, LLMs used for research or news summarization could systematically under-credit authors from certain demographic groups.
Auditing new models with AttriBench-style balanced data could help identify and reduce both accuracy gaps and suppression.
Similar uneven suppression might occur in related tasks such as fact-checking or source citation in generated text.

Load-bearing premise

The fame- and demographically-balanced construction of AttriBench successfully isolates demographic effects on attribution without introducing new confounding variables from quote selection, balancing methods, or prompt variations.

What would settle it

Repeating the evaluations on an independently built quote attribution dataset that is also balanced for fame and demographics but uses different quote sources and selection methods, then checking whether the same accuracy disparities and uneven suppression rates appear.

Figures

Figures reproduced from arXiv: 2604.05224 by Bella Chang, Daniel B. Neill, Eliza Berman, Emily Black.

**Figure 1.** Figure 1: Example of suppression in quote attribution. GPT-5.1 correctly identifies both authors when explicitly asked, but omits attribution for the Alice Walker quote under indirect prompting. Both authors have similar fame, as measured by Google Search hits. 1 Introduction The rapid adoption of Large Language Models (LLMs) has transformed how users access and obtain information, particularly in scholarly workflow… view at source ↗

**Figure 2.** Figure 2: Overview of the attribution evaluation framework. We compare direct and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset construction pipeline. From a corpus of 500K quotes, we first filter for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overall attribution accuracy (% correct) across models and prompts. Note remark [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Subgroup level quote attribution accuracy (% correct author) across models. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Omission suppression Somit: probability of producing no author under indirect prompting without evidence. Cells show mean suppression (%), with color indicating deviation from the model mean (blue = lower, red = higher). Bold denotes the lowest suppression subgroup per model; * denotes it is statistically significantly lower than all other groups (p < .05). Across models, suppression is consistently lowest… view at source ↗

**Figure 7.** Figure 7: Evidence-conditioned suppression Sevid: probability of failing to produce the correct author under indirect prompting when the correct author is explicitly present in the input. Cells show mean suppression (%), with color indicating deviation from the model mean (blue = lower, red = higher). Bold marks the lowest suppression subgroup per model; * denote it is statistically significantly lower than all othe… view at source ↗

**Figure 8.** Figure 8: Mean attribution accuracy by author fame Google Search hits (binned [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Race and gender distribution of the original JSTET dataset, showing substan [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Kernel density estimates of log-scaled Google Search hits used as a proxy for [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Overall attribution accuracy (% correct) across models and prompt types in the [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Subgroup level quote attribution accuracy (% correct author) across models for [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Subgroup accuracy (% correct author) across models under indirect overt prompt [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) are increasingly used to support search and information retrieval, it is critical that they accurately attribute content to its original authors. In this work, we introduce AttriBench, the first fame- and demographically-balanced quote attribution benchmark dataset. Through explicitly balancing author fame and demographics, AttriBench enables controlled investigation of demographic bias in quote attribution. Using this dataset, we evaluate 11 widely used LLMs across different prompt settings and find that quote attribution remains a challenging task even for frontier models. We observe large and systematic disparities in attribution accuracy between race, gender, and intersectional groups. We further introduce and investigate suppression, a distinct failure mode in which models omit attribution entirely, even when the model has access to authorship information. We find that suppression is widespread and unevenly distributed across demographic groups, revealing systematic biases not captured by standard accuracy metrics. Our results position quote attribution as a benchmark for representational fairness in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces AttriBench to measure demographic bias in LLM quote attribution and highlights suppression as a distinct issue, though the disparities may not be fully isolated from quote characteristics.

read the letter

The main thing to know is that the authors built AttriBench, a quote attribution dataset balanced on author fame and demographics, and used it to show LLMs have uneven performance across groups plus a tendency to suppress attributions altogether. They test 11 models and find attribution challenging overall, with systematic differences by race, gender, and intersections. Suppression appears widespread and uneven, which standard metrics overlook. This is new in the sense of the balanced benchmark and the suppression concept. The balancing effort is a step up from typical setups and helps focus on the demographic angle. The soft spot is potential confounding. Even with fame and demographic balancing, quotes might differ in length, topic, complexity, or other features that affect how easy attribution is. Without reported checks on those after balancing, the gaps could partly trace to quote selection rather than model bias. The abstract leaves that open, so the full paper needs to address it directly. Suppression is a solid observation, but it would benefit from more on how it's measured and whether prompts influence it. This paper is for researchers focused on fairness in LLMs for search and retrieval. Anyone building or using attribution benchmarks would find the dataset and the failure mode analysis relevant. It deserves a serious referee. The empirical results and new dataset provide enough to review, even if revisions are needed on the controls. I'd recommend sending it to peer review with attention to the balancing validation.

Referee Report

1 major / 2 minor

Summary. The paper introduces AttriBench, a fame- and demographically-balanced quote attribution benchmark dataset, and uses it to evaluate 11 LLMs across prompt settings. It reports large systematic disparities in attribution accuracy across race, gender, and intersectional groups, and introduces 'suppression' (omission of attribution despite available information) as a distinct failure mode that is widespread and unevenly distributed.

Significance. The work supplies a new controlled benchmark for studying representational fairness in a practical LLM task (quote attribution for search/retrieval). The empirical scale (11 models, multiple prompt settings) and the separation of suppression from accuracy metrics are useful contributions; if the disparities survive controls for quote-intrinsic features, the results would strengthen evidence that current LLMs exhibit systematic demographic biases in content attribution.

major comments (1)

[Dataset construction] Dataset construction section: the central claim that 'explicitly balancing author fame and demographics' isolates demographic effects on attribution accuracy rests on the assumption that quote-intrinsic covariates (length, syntactic complexity, topic, cultural specificity, domain) are also balanced across groups. No post-balancing statistics, matching tables, or covariate checks on these features are reported. If such imbalances exist, the headline disparities could be artifacts of quote selection rather than model bias.

minor comments (2)

[Experimental setup] Experimental details on the exact prompt templates, temperature settings, and how 'different prompt settings' were varied should be moved from any appendix into the main text or a dedicated table for reproducibility.
[Results] The results section would benefit from reporting statistical significance (e.g., p-values or confidence intervals) for the group-wise accuracy differences rather than relying solely on raw percentages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comment raises an important point about potential confounds in the dataset construction, and we address it directly below. We are prepared to incorporate additional analyses in a revised version.

read point-by-point responses

Referee: Dataset construction section: the central claim that 'explicitly balancing author fame and demographics' isolates demographic effects on attribution accuracy rests on the assumption that quote-intrinsic covariates (length, syntactic complexity, topic, cultural specificity, domain) are also balanced across groups. No post-balancing statistics, matching tables, or covariate checks on these features are reported. If such imbalances exist, the headline disparities could be artifacts of quote selection rather than model bias.

Authors: We agree that the manuscript does not report post-balancing statistics or covariate checks for quote-intrinsic features such as length, syntactic complexity, topic, cultural specificity, or domain. Our balancing procedure was performed at the author level for fame and demographics, but we did not systematically verify or document balance on these quote-level covariates. This is a valid limitation of the current version. In the revised manuscript we will add matching tables, descriptive statistics, and balance checks (e.g., mean and distribution comparisons across demographic groups) for the listed covariates to allow readers to assess whether the observed attribution disparities are attributable to demographic factors or to quote selection artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with independent dataset construction and model testing

full rationale

The paper introduces AttriBench as a new fame- and demographically-balanced dataset and reports direct empirical evaluations of 11 LLMs on quote attribution accuracy and suppression rates across demographic groups. No mathematical derivations, equations, fitted parameters, or self-referential definitions are present that would reduce the observed disparities to the inputs by construction. Claims rest on external model outputs evaluated against the constructed test set rather than any self-citation chain or renaming of prior results, rendering the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Work is empirical and introduces a new dataset plus observational category; relies on standard assumptions about LLM prompting and the validity of demographic balancing for bias measurement.

axioms (2)

domain assumption Quote attribution accuracy can be reliably measured through prompted LLM responses on a curated dataset
Underpins the entire evaluation of 11 models across prompt settings.
ad hoc to paper Explicit balancing of author fame and demographics isolates bias effects from confounding variables
Central premise enabling the claim of controlled investigation of demographic bias.

invented entities (1)

suppression no independent evidence
purpose: A distinct failure mode where models omit attribution entirely despite access to authorship information
Newly defined based on model behavior observations to capture a bias not reflected in accuracy metrics.

pith-pipeline@v0.9.0 · 5459 in / 1395 out tokens · 64421 ms · 2026-05-10T18:38:50.652889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ATTRIBENCH, the first fame- and demographically-balanced quote attribution benchmark dataset... We observe large and systematic disparities in attribution accuracy between race, gender, and intersectional groups. We further introduce and investigate suppression...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We measure three distinct phenomena in LLM attribution: accuracy, disparity, and suppression... Somit = Pr(M(q) = ∅ | x(q) = x_ind(q))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages

[1]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

URLhttps://arxiv.org/abs/2508.02740. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, ...

work page doi:10.18653/v1/ 2026
[2]

Default Sonar model

URLhttps://docs.perplexity.ai/docs/sonar-models. Default Sonar model. Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?, 2024. URLhttps://arxiv.org/abs/2407.12861. Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanj...

work page doi:10.1093/jamia/ocaf063 2024
[3]

and”, “&

Remove non-individual entities from dataset.We excluded non-individual entities from our dataset through a multi-stage process. • Regex pattern matching for multi-person indicators (e.g., “and”, “&”, “/”, “feat.”, “ft.”, “vs.”) • Keyword filtering for organizational terms (e.g., “Collective”, “Orchestra”, “Band”, “Records”, “University”, “Ministry”, “Coun...
[4]

Alternative is blank for rows without a second name

Edit author names to task-conducive formatting.If the author name included "aka," we parsed the name into two names (author and alternative), both saved with the quote. Alternative is blank for rows without a second name
[5]

C. S. Lewis

Standardize spacing and punctuation: We applied standardized spacing in initials with periods (e.g., “C. S. Lewis” vs “C.S. Lewis” vs “CS Lewis” and “Charlotte Bronte” vs “Charlotte Brontë” vs “charlotte brönte”)
[6]

author",

Remove trailing byline attributions.Several quotes in the dataset listed byline attributions, such as (by "author", "– author", "(Author)". All such mentions of the author within the quote body were removed
[7]

Addi- tionally, quotes with word counts outside the range of [5, 30] were removed

Apply quote quality filters.Quotes with non-Latin script were removed. Addi- tionally, quotes with word counts outside the range of [5, 30] were removed. We applied a strict cap of 10 quotes per author, therefore authors have between 1 and 10 corresponding quotes
[8]

indirect overt

Remove duplicates. When two entries contained the same exact overlapping quote text, we kept the longer of the two. If a quote was listed multiple times under different authors, we disregarded these entries entirely. Note that we did not edit or censor the dataset for quote content. 15 Preprint. Under review. Figure 8: Mean attribution accuracy by author ...

2026