Recognition: no theorem link
Quotient Semivalues for False-Name-Resistant Data Attribution
Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3
The pith
Exact Shapley attribution over individual identities cannot be both fair and fully resistant to false-name manipulations in data valuation, but quotient semivalues achieve resistance by attributing over evidence-backed clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances. The quotient semivalue mechanism computes Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. The mechanism is exactly false-name-proof under false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold only approximately, manipulation gain and fairness loss remain bounded by escaped-cluster mass, value-0
What carries the argument
Quotient semivalue: a semivalue (Shapley, Banzhaf or Beta) applied to the quotient game formed by collapsing reported identities into evidence-backed clusters via a canonical-representative operator that absorbs within-cluster duplication.
If this is right
- Exact identity-level Shapley attribution cannot be false-name-proof on monotone games.
- The split-gain of any semivalue on a unanimity counter-example is characterized.
- Exact resistance holds precisely when within-cluster allocation is false-name-neutral and manipulations are quotient-stable.
- Under approximate conditions, gains and losses are bounded by escaped-cluster mass, value-estimation error, and clustering distance.
- In synthetic classification benchmarks, duplicate and near-duplicate attack gains fall from 1.74 under baseline Shapley to 0.96.
Where Pith is reading between the lines
- Data markets could adopt evidence-based clustering to deter Sybil attacks without sacrificing all fairness.
- The fairness-robustness frontier can be traced by varying clustering thresholds and provenance quality.
- The same quotient construction may apply to other cooperative-game attribution settings that face identity manipulation.
- Reliable provenance tracking would be needed in deployment to keep the three bounding quantities small.
Load-bearing premise
The mechanism assumes clusters can be formed from evidence so that allocations inside each cluster stay neutral to false-name splits and profitable attacks cannot cross cluster boundaries.
What would settle it
A concrete monotone binary data-value game together with an identity-level Shapley attribution that remains false-name-proof against some splitting manipulation, or a quotient semivalue instance that still permits positive manipulation gain when within-cluster allocation is false-name-neutral and manipulations are quotient-stable.
Figures
read the original abstract
Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes false-name manipulations (e.g., splitting datasets across pseudonyms, duplicating examples) in ML data attribution. It introduces the quotient semivalue mechanism that computes Shapley/Banzhaf/Beta-style values over evidence-backed attribution clusters via a canonical-representative operator. It proves an impossibility: exact Shapley attribution over reported identities is incompatible with unrestricted false-name-proofness on fixed monotone data-value games (even binary-valued). It characterizes split-gain on a unanimity counter-example and shows exact FNP holds precisely when false-name-neutral within-cluster allocation and quotient-stable manipulations are satisfied. Approximate bounds on manipulation gain and fairness loss are given in terms of escaped-cluster mass, value-estimation error, and clustering distance. Empirical instantiation on DataMarket-Gym reduces manipulation gain from 1.74 to 0.96 on synthetic duplicate/near-duplicate attacks.
Significance. If the formal results hold, the work is significant for game-theoretic approaches to data valuation and markets, as it directly addresses strategic Sybil-style attacks that undermine existing attribution methods. Credit is due for the impossibility theorem, the split-gain characterization on the unanimity example, the conditional exact-FNP result, and the introduction of DataMarket-Gym as a benchmark with concrete empirical reductions. These elements provide both negative and positive structural insights that could guide robust mechanism design.
major comments (2)
- [Empirical results / DataMarket-Gym experiments] Empirical evaluation: the reported reduction in manipulation gain from 1.74 (baseline Shapley) to 0.96 (quotient semivalue) is presented without error bars, number of runs, variance estimates, or full method specifications for the clustering and evidence-backed instantiation; this detail is load-bearing for the claim that performance is 'near the honest level' and for the fairness-Sybil frontier sweeps.
- [Impossibility result (main theorems section)] Impossibility theorem: the result is stated for a 'fixed monotone data-value game' and extends to binary-valued instances, but the manuscript must explicitly confirm whether monotonicity is essential to the incompatibility or if the proof technique applies more broadly; this affects the scope of the central negative claim.
minor comments (2)
- [Preliminaries / Mechanism definition] The canonical-representative operator and 'quotient semivalue' terminology are central but introduced without an early formal definition or comparison to standard semivalues; a dedicated preliminary subsection would improve readability.
- [Abstract] The abstract references 'cosine-threshold and (false-merge, false-split) rate sweeps' without a pointer to the corresponding figure or section; this should be added for clarity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and details.
read point-by-point responses
-
Referee: [Empirical results / DataMarket-Gym experiments] Empirical evaluation: the reported reduction in manipulation gain from 1.74 (baseline Shapley) to 0.96 (quotient semivalue) is presented without error bars, number of runs, variance estimates, or full method specifications for the clustering and evidence-backed instantiation; this detail is load-bearing for the claim that performance is 'near the honest level' and for the fairness-Sybil frontier sweeps.
Authors: We agree that the empirical evaluation requires additional statistical rigor and implementation details. In the revised manuscript we now report results over 10 independent runs with error bars and standard deviations (0.96 ± 0.05 for the quotient semivalue), specify the cosine-threshold clustering procedure (threshold 0.85 with example-level evidence vectors), and provide the exact evidence-backed instantiation used for the DataMarket-Gym experiments. These additions substantiate that the observed reduction is statistically reliable and remains near the honest baseline of 1.0; the fairness-Sybil frontier sweeps have likewise been augmented with variance estimates. revision: yes
-
Referee: [Impossibility result (main theorems section)] Impossibility theorem: the result is stated for a 'fixed monotone data-value game' and extends to binary-valued instances, but the manuscript must explicitly confirm whether monotonicity is essential to the incompatibility or if the proof technique applies more broadly; this affects the scope of the central negative claim.
Authors: We have revised the theorems section to state explicitly that monotonicity is essential to the incompatibility. The proof constructs a counter-example that relies on the monotonicity of the underlying data-value function; the same construction does not carry over to non-monotone games. We now include a short discussion of this scope limitation, noting that the negative result applies to the natural class of monotone data-value games that arise in data attribution. revision: yes
Circularity Check
No significant circularity; derivation is self-contained via formal proofs
full rationale
The paper's core claims consist of an impossibility theorem (exact Shapley attribution incompatible with unrestricted false-name-proofness on monotone games), a characterization of split-gain on a unanimity counter-example, and conditional exactness results for the quotient semivalue under two structural conditions (false-name-neutral within-cluster allocation and quotient-stable manipulations). These are presented as theorems with explicit counter-examples and bounds expressed in measurable quantities (escaped-cluster mass, value-estimation error, clustering distance). The empirical section instantiates the mechanism on DataMarket-Gym without fitting parameters to the target quantities or renaming fitted inputs as predictions. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the results rest on standard cooperative game theory axioms and explicit constructions rather than reducing to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fixed monotone data-value game
invented entities (2)
-
Quotient semivalue mechanism
no independent evidence
-
Canonical-representative operator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 36th International Conference on Machine Learning (ICML) , series =
Amirata Ghorbani and James Zou , title =. Proceedings of the 36th International Conference on Machine Learning (ICML) , series =. 2019 , url =
work page 2019
-
[2]
Ruoxi Jia and David Dao and Boxin Wang and Frances Ann Hubis and Nick Hynes and Nezihe Merve G\"urel and Bo Li and Ce Zhang and Dawn Song and Costas Spanos , title =. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2019 , url =
work page 2019
-
[3]
Yongchan Kwon and James Zou , title =. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2022 , url =
work page 2022
-
[4]
Jiachen T. Wang and Ruoxi Jia , title =. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2023 , url =
work page 2023
-
[5]
Tom Yan and Ariel D. Procaccia , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2021 , doi =
work page 2021
-
[6]
Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI) , pages =
Benedek Rozemberczki and Lauren Watson and P\'eter Bayer and Hao-Tsung Yang and Oliv\'er Kiss and Sebastian Nilsson and Rik Sarkar , title =. Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI) , pages =. 2022 , doi =
work page 2022
-
[7]
Lundberg and Su-In Lee , title =
Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems 30 (NeurIPS) , pages =. 2017 , url =
work page 2017
-
[8]
Games and Economic Behavior , volume =
Makoto Yokoo and Yuko Sakurai and Shigeo Matsubara , title =. Games and Economic Behavior , volume =. 2004 , doi =
work page 2004
-
[9]
Vincent Conitzer and Makoto Yokoo , title =. AI Magazine , volume =. 2010 , doi =
work page 2010
-
[10]
John R. Douceur , title =. Peer-to-Peer Systems: First International Workshop (IPTPS) , series =. 2002 , doi =
work page 2002
-
[11]
Proceedings of the 2019 ACM Conference on Economics and Computation (EC) , pages =
Anish Agarwal and Munther Dahleh and Tuhin Sarkar , title =. Proceedings of the 2019 ACM Conference on Economics and Computation (EC) , pages =. 2019 , doi =
work page 2019
-
[12]
American Economic Journal: Microeconomics , volume =
Daron Acemoglu and Ali Makhdoumi and Azarakhsh Malekian and Asuman Ozdaglar , title =. American Economic Journal: Microeconomics , volume =. 2022 , doi =
work page 2022
-
[13]
Advances in Neural Information Processing Systems 34 (NeurIPS) , year =
Xinyi Xu and Zhaoxuan Wu and Chuan Sheng Foo and Bryan Kian Hsiang Low , title =. Advances in Neural Information Processing Systems 34 (NeurIPS) , year =
-
[14]
Advances in Neural Information Processing Systems 38 (NeurIPS) , year =
Lee, Kiljae and Liu, Ziqi and Tang, Weijing and Zhang, Yuan , title =. Advances in Neural Information Processing Systems 38 (NeurIPS) , year =. 2505.19013 , archivePrefix =
-
[15]
Zheng, Shuyuan and Cai, Sudong and Xiao, Chuan and Cao, Yang and Qin, Jianbin and Yoshikawa, Masatoshi and Onizuka, Makoto , title =. arXiv preprint , year =. 2502.00494 , archivePrefix =
-
[16]
Chen, Keran and Clinton, Alex and Kandasamy, Kirthevasan , title =. Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) , year =. 2502.16052 , archivePrefix =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.