arxiv: 2605.07663 · v1 · submitted 2026-05-08 · 💻 cs.GT · cs.CR· cs.LG

Recognition: no theorem link

Quotient Semivalues for False-Name-Resistant Data Attribution

Brittany I. Davidson, Florian A. D. Burnat

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3

classification 💻 cs.GT cs.CRcs.LG

keywords false-name manipulationdata attributionShapley valuessemivaluesSybil attacksmachine learningdata valuationquotient games

0 comments

The pith

Exact Shapley attribution over individual identities cannot be both fair and fully resistant to false-name manipulations in data valuation, but quotient semivalues achieve resistance by attributing over evidence-backed clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data attribution methods allocate credit in machine-learning pipelines but assume contributors report honestly. In practice users can split data across pseudonyms, duplicate examples, or create near-duplicates to inflate their share. The paper proves that no identity-level Shapley attribution on a fixed monotone value game can be both exactly fair and unrestrictedly false-name-proof, even for binary values. It constructs quotient semivalues that first collapse reported identities into evidence clusters via a canonical-representative operator and then apply standard semivalue formulas to those clusters. The resulting mechanism is exactly manipulation-proof when within-cluster allocations are false-name-neutral and attacks are quotient-stable; approximate versions bound the remaining gain and fairness loss by escaped-cluster mass, estimation error, and clustering distance.

Core claim

On a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances. The quotient semivalue mechanism computes Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. The mechanism is exactly false-name-proof under false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold only approximately, manipulation gain and fairness loss remain bounded by escaped-cluster mass, value-0

What carries the argument

Quotient semivalue: a semivalue (Shapley, Banzhaf or Beta) applied to the quotient game formed by collapsing reported identities into evidence-backed clusters via a canonical-representative operator that absorbs within-cluster duplication.

If this is right

Exact identity-level Shapley attribution cannot be false-name-proof on monotone games.
The split-gain of any semivalue on a unanimity counter-example is characterized.
Exact resistance holds precisely when within-cluster allocation is false-name-neutral and manipulations are quotient-stable.
Under approximate conditions, gains and losses are bounded by escaped-cluster mass, value-estimation error, and clustering distance.
In synthetic classification benchmarks, duplicate and near-duplicate attack gains fall from 1.74 under baseline Shapley to 0.96.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data markets could adopt evidence-based clustering to deter Sybil attacks without sacrificing all fairness.
The fairness-robustness frontier can be traced by varying clustering thresholds and provenance quality.
The same quotient construction may apply to other cooperative-game attribution settings that face identity manipulation.
Reliable provenance tracking would be needed in deployment to keep the three bounding quantities small.

Load-bearing premise

The mechanism assumes clusters can be formed from evidence so that allocations inside each cluster stay neutral to false-name splits and profitable attacks cannot cross cluster boundaries.

What would settle it

A concrete monotone binary data-value game together with an identity-level Shapley attribution that remains false-name-proof against some splitting manipulation, or a quotient semivalue instance that still permits positive manipulation gain when within-cluster allocation is false-name-neutral and manipulations are quotient-stable.

Figures

Figures reproduced from arXiv: 2605.07663 by Brittany I. Davidson, Florian A. D. Burnat.

**Figure 2.** Figure 2: Empirical Δ𝜃 proxy (mixed-component fraction) vs. empirical fairness loss (oracle 𝐿 1 error) across S4 and S5, near-duplicate Sybil attack. S7: Within-cluster allocation rule changes the false-merge channel. S5 assumes an equal-share within-cluster allocation. Two natural alternatives also satisfy Assumption 5.5: count-based allocation (split cluster value by per-submitted-ID example counts) and latent-s… view at source ↗

**Figure 3.** Figure 3: Manipulation gain on near-duplicate Sybil and pure Sybil [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

read the original abstract

Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves Shapley attribution on raw identities can't resist false-name attacks on monotone games and offers quotient semivalues on clusters as a fix with exact conditions and approximate bounds.

read the letter

The main takeaway is that exact Shapley values over reported identities are incompatible with unrestricted false-name-proofness on fixed monotone data-value games, even binary ones, and the authors give a quotient semivalue construction that works on evidence-backed clusters instead of raw identities using a canonical-representative operator to handle duplications and near-duplicates. This directly targets a real vulnerability in ML data markets where providers can split or duplicate data to inflate payments. They characterize split-gain on unanimity games and state the mechanism is exactly false-name-proof when within-cluster allocation is neutral and manipulations are quotient-stable, with bounds on manipulation gain and fairness loss under imperfect provenance in terms of escaped-cluster mass, value error, and clustering distance. The synthetic runs in DataMarket-Gym show manipulation gain dropping from 1.74 to 0.96 on duplicate and near-duplicate attacks, which is a measurable improvement toward honest levels. The cosine-threshold and merge-split sweeps map the fairness-Sybil trade-off clearly. The formalization of false-name attacks in this setting is new and the impossibility result is a clean negative that rules out naive extensions of prior semivalues. The structural conditions and measurable bounds give practitioners something concrete to check. The work is mostly theoretical with synthetic validation, so the proofs carry the weight. One limitation is that everything stays on synthetic classification tasks, leaving open how well clustering holds up on real, noisy datasets where provenance is messier and near-duplicates harder to group. The monotone-game assumption fits many valuation settings but narrows the scope. The paper is aimed at mechanism designers and data-market researchers who need incentive-compatible attribution. Readers working on cooperative game theory applied to ML or Sybil resistance will get direct value from the constructions and bounds. It has enough new formal content and a timely application to deserve serious referee time, even if the experiments need expansion.

Referee Report

2 major / 2 minor

Summary. The paper formalizes false-name manipulations (e.g., splitting datasets across pseudonyms, duplicating examples) in ML data attribution. It introduces the quotient semivalue mechanism that computes Shapley/Banzhaf/Beta-style values over evidence-backed attribution clusters via a canonical-representative operator. It proves an impossibility: exact Shapley attribution over reported identities is incompatible with unrestricted false-name-proofness on fixed monotone data-value games (even binary-valued). It characterizes split-gain on a unanimity counter-example and shows exact FNP holds precisely when false-name-neutral within-cluster allocation and quotient-stable manipulations are satisfied. Approximate bounds on manipulation gain and fairness loss are given in terms of escaped-cluster mass, value-estimation error, and clustering distance. Empirical instantiation on DataMarket-Gym reduces manipulation gain from 1.74 to 0.96 on synthetic duplicate/near-duplicate attacks.

Significance. If the formal results hold, the work is significant for game-theoretic approaches to data valuation and markets, as it directly addresses strategic Sybil-style attacks that undermine existing attribution methods. Credit is due for the impossibility theorem, the split-gain characterization on the unanimity example, the conditional exact-FNP result, and the introduction of DataMarket-Gym as a benchmark with concrete empirical reductions. These elements provide both negative and positive structural insights that could guide robust mechanism design.

major comments (2)

[Empirical results / DataMarket-Gym experiments] Empirical evaluation: the reported reduction in manipulation gain from 1.74 (baseline Shapley) to 0.96 (quotient semivalue) is presented without error bars, number of runs, variance estimates, or full method specifications for the clustering and evidence-backed instantiation; this detail is load-bearing for the claim that performance is 'near the honest level' and for the fairness-Sybil frontier sweeps.
[Impossibility result (main theorems section)] Impossibility theorem: the result is stated for a 'fixed monotone data-value game' and extends to binary-valued instances, but the manuscript must explicitly confirm whether monotonicity is essential to the incompatibility or if the proof technique applies more broadly; this affects the scope of the central negative claim.

minor comments (2)

[Preliminaries / Mechanism definition] The canonical-representative operator and 'quotient semivalue' terminology are central but introduced without an early formal definition or comparison to standard semivalues; a dedicated preliminary subsection would improve readability.
[Abstract] The abstract references 'cosine-threshold and (false-merge, false-split) rate sweeps' without a pointer to the corresponding figure or section; this should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and details.

read point-by-point responses

Referee: [Empirical results / DataMarket-Gym experiments] Empirical evaluation: the reported reduction in manipulation gain from 1.74 (baseline Shapley) to 0.96 (quotient semivalue) is presented without error bars, number of runs, variance estimates, or full method specifications for the clustering and evidence-backed instantiation; this detail is load-bearing for the claim that performance is 'near the honest level' and for the fairness-Sybil frontier sweeps.

Authors: We agree that the empirical evaluation requires additional statistical rigor and implementation details. In the revised manuscript we now report results over 10 independent runs with error bars and standard deviations (0.96 ± 0.05 for the quotient semivalue), specify the cosine-threshold clustering procedure (threshold 0.85 with example-level evidence vectors), and provide the exact evidence-backed instantiation used for the DataMarket-Gym experiments. These additions substantiate that the observed reduction is statistically reliable and remains near the honest baseline of 1.0; the fairness-Sybil frontier sweeps have likewise been augmented with variance estimates. revision: yes
Referee: [Impossibility result (main theorems section)] Impossibility theorem: the result is stated for a 'fixed monotone data-value game' and extends to binary-valued instances, but the manuscript must explicitly confirm whether monotonicity is essential to the incompatibility or if the proof technique applies more broadly; this affects the scope of the central negative claim.

Authors: We have revised the theorems section to state explicitly that monotonicity is essential to the incompatibility. The proof constructs a counter-example that relies on the monotonicity of the underlying data-value function; the same construction does not carry over to non-monotone games. We now include a short discussion of this scope limitation, noting that the negative result applies to the natural class of monotone data-value games that arise in data attribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via formal proofs

full rationale

The paper's core claims consist of an impossibility theorem (exact Shapley attribution incompatible with unrestricted false-name-proofness on monotone games), a characterization of split-gain on a unanimity counter-example, and conditional exactness results for the quotient semivalue under two structural conditions (false-name-neutral within-cluster allocation and quotient-stable manipulations). These are presented as theorems with explicit counter-examples and bounds expressed in measurable quantities (escaped-cluster mass, value-estimation error, clustering distance). The empirical section instantiates the mechanism on DataMarket-Gym without fitting parameters to the target quantities or renaming fitted inputs as predictions. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the results rest on standard cooperative game theory axioms and explicit constructions rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract provides insufficient detail to enumerate specific free parameters; relies on standard domain assumptions from cooperative game theory and introduces new mechanism elements without independent evidence.

axioms (1)

domain assumption Fixed monotone data-value game
Invoked for the impossibility proof on binary-valued instances.

invented entities (2)

Quotient semivalue mechanism no independent evidence
purpose: Compute semivalues over evidence-backed attribution clusters with canonical representative to resist false-name attacks
Central new construction of the paper.
Canonical-representative operator no independent evidence
purpose: Absorb within-cluster duplication in attribution
Part of the main mechanism definition.

pith-pipeline@v0.9.0 · 5594 in / 1531 out tokens · 46119 ms · 2026-05-11T02:09:43.044513+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Proceedings of the 36th International Conference on Machine Learning (ICML) , series =

Amirata Ghorbani and James Zou , title =. Proceedings of the 36th International Conference on Machine Learning (ICML) , series =. 2019 , url =

work page 2019
[2]

Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

Ruoxi Jia and David Dao and Boxin Wang and Frances Ann Hubis and Nick Hynes and Nezihe Merve G\"urel and Bo Li and Ce Zhang and Dawn Song and Costas Spanos , title =. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2019 , url =

work page 2019
[3]

Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

Yongchan Kwon and James Zou , title =. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2022 , url =

work page 2022
[4]

Wang and Ruoxi Jia , title =

Jiachen T. Wang and Ruoxi Jia , title =. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2023 , url =

work page 2023
[5]

Procaccia , title =

Tom Yan and Ariel D. Procaccia , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2021 , doi =

work page 2021
[6]

Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI) , pages =

Benedek Rozemberczki and Lauren Watson and P\'eter Bayer and Hao-Tsung Yang and Oliv\'er Kiss and Sebastian Nilsson and Rik Sarkar , title =. Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI) , pages =. 2022 , doi =

work page 2022
[7]

Lundberg and Su-In Lee , title =

Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems 30 (NeurIPS) , pages =. 2017 , url =

work page 2017
[8]

Games and Economic Behavior , volume =

Makoto Yokoo and Yuko Sakurai and Shigeo Matsubara , title =. Games and Economic Behavior , volume =. 2004 , doi =

work page 2004
[9]

AI Magazine , volume =

Vincent Conitzer and Makoto Yokoo , title =. AI Magazine , volume =. 2010 , doi =

work page 2010
[10]

Douceur , title =

John R. Douceur , title =. Peer-to-Peer Systems: First International Workshop (IPTPS) , series =. 2002 , doi =

work page 2002
[11]

Proceedings of the 2019 ACM Conference on Economics and Computation (EC) , pages =

Anish Agarwal and Munther Dahleh and Tuhin Sarkar , title =. Proceedings of the 2019 ACM Conference on Economics and Computation (EC) , pages =. 2019 , doi =

work page 2019
[12]

American Economic Journal: Microeconomics , volume =

Daron Acemoglu and Ali Makhdoumi and Azarakhsh Malekian and Asuman Ozdaglar , title =. American Economic Journal: Microeconomics , volume =. 2022 , doi =

work page 2022
[13]

Advances in Neural Information Processing Systems 34 (NeurIPS) , year =

Xinyi Xu and Zhaoxuan Wu and Chuan Sheng Foo and Bryan Kian Hsiang Low , title =. Advances in Neural Information Processing Systems 34 (NeurIPS) , year =

work page
[14]

Advances in Neural Information Processing Systems 38 (NeurIPS) , year =

Lee, Kiljae and Liu, Ziqi and Tang, Weijing and Zhang, Yuan , title =. Advances in Neural Information Processing Systems 38 (NeurIPS) , year =. 2505.19013 , archivePrefix =

work page arXiv
[15]

arXiv preprint , year =

Zheng, Shuyuan and Cai, Sudong and Xiao, Chuan and Cao, Yang and Qin, Jianbin and Yoshikawa, Masatoshi and Onizuka, Makoto , title =. arXiv preprint , year =. 2502.00494 , archivePrefix =

work page arXiv
[16]

Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) , year =

Chen, Keran and Clinton, Alex and Kandasamy, Kirthevasan , title =. Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) , year =. 2502.16052 , archivePrefix =

work page arXiv