arxiv: 2605.11404 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Attributing Emergence in Million-Agent Systems

Ling Tang , Jilin Mei , Qian Chen , Qihan Ren , Linfeng Zhang , Quanshi Zhang , Jing Shao , Xia Hu

show 1 more author

Dongrui Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords emergence attributionmulti-agent systemsAumann-Shapleyscaling biasLLM agentsBluesky datanonlinear indicators

0 comments

The pith

Attributing macro emergence in million-agent systems requires full-scale computation because small-scale samples cannot be reconciled by rescaling under nonlinear indicators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large multi-agent systems powered by LLMs can model social phenomena at population scale, but attributing which agents cause the macro behavior has been limited to small numbers of agents. This work adapts the Aumann-Shapley attribution method to run efficiently at million-agent scale while satisfying the required axioms. Applying it to real Bluesky user data shows that small convenience samples heavily bias attribution toward a few popular accounts, while full scale spreads it across the long tail. The authors prove that for any nonlinear measure of emergence, no simple rescaling can make small-scale results match the full-scale truth. This means accurate attribution demands computing at the actual scale of the phenomenon.

Core claim

We adapt Aumann-Shapley path-integral attribution to LLM-powered multi-agent systems at million-agent scale; the method satisfies all four axioms and runs orders of magnitude faster. On 1.6 million Bluesky users, full-scale attribution assigns majority to long tail and middle tier, while N=100 biased samples attribute to high-follower accounts. We prove via the Attribution Scaling Bias theorem that under any nonlinear macro indicator, no global rescaling factor can reconcile small-scale and full-scale attribution.

What carries the argument

Adapted Aumann-Shapley path-integral attribution that scales to million agents while satisfying the four axioms, used to establish the Attribution Scaling Bias theorem showing irreconcilability of scales for nonlinear indicators.

Load-bearing premise

The macro indicator used to measure emergence is nonlinear.

What would settle it

Empirical demonstration in a million-agent system with a nonlinear macro indicator where rescaling a small-scale attribution produces results identical to the full-scale computation.

Figures

Figures reproduced from arXiv: 2605.11404 by Dongrui Liu, Jilin Mei, Jing Shao, Linfeng Zhang, Ling Tang, Qian Chen, Qihan Ren, Quanshi Zhang, Xia Hu.

read the original abstract

Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$, while the phenomena they explain occur at $N \geq 10^6$. We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ($1{,}671{,}587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^2$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales Aumann-Shapley attribution to 1.6 million agents on Bluesky data and proves that small-sample attributions cannot be fixed by rescaling for nonlinear macros, but the axiom preservation in their adaptation is the part that still needs verification.

read the letter

The key point is that this work adapts Aumann-Shapley path integrals to run at million-agent scale, applies it to 14 days of real Bluesky activity with 1.67 million users, and proves an Attribution Scaling Bias theorem showing that no global rescaling can align small-scale and full-scale attributions when the macro indicator is nonlinear. They report the method meets the four standard axioms and runs four to five orders of magnitude faster than sampled Shapley. The empirical result is straightforward: full-scale attribution spreads credit across the long tail and middle tier, while the visibility-biased small sample pins almost everything on a few high-follower accounts. The theorem then rules out post-hoc fixes for that mismatch. This is new relative to prior attribution work, which stayed under a few thousand agents, and the data example makes the scale gap concrete rather than abstract. The proof and the speedup are the parts that stand out as useful if they hold. The soft spot is exactly where the stress-test note flags it: the adaptation to discrete agents at N=1.6e6. Standard Aumann-Shapley path integrals need a value function that can be evaluated along paths, and at this scale they must rely on some aggregation or sampling shortcut to get the claimed speed. If that shortcut introduces even mild dependence on interaction structure or independence assumptions, the axioms hold only conditionally and the theorem loses some of its generality for arbitrary nonlinear macros. The abstract asserts no extra assumptions are needed, but without the full derivation and implementation details it is hard to judge how clean the reduction is. This paper is for researchers working on multi-agent LLM systems or computational social science who need attribution at population scale. A reader who cares about whether small panels can stand in for full populations will find the data comparison and the theorem directly relevant. It deserves peer review because the scaling demonstration and the bias result are worth checking in detail, even if the axiom step requires extra scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces an adaptation of the Aumann-Shapley path-integral attribution method for attributing individual agent contributions to macro-level emergence in LLM-powered multi-agent systems at scales up to 1.67 million agents. It claims the method satisfies the four standard attribution axioms, offers substantial computational efficiency gains over traditional sampled Shapley values, empirically demonstrates structural disagreement between full-scale and small-sample attributions using Bluesky social media data, and proves an Attribution Scaling Bias theorem showing that no global rescaling can reconcile these for nonlinear macro indicators.

Significance. If the adaptation preserves the axioms without hidden assumptions and the theorem is correctly proven, this would provide a valuable tool for rigorous attribution in large-scale agent-based simulations of social phenomena. The combination of empirical evidence from real-world data and the theoretical result on scaling bias could influence how emergence is studied in complex systems, emphasizing the need for full-scale analysis rather than relying on convenience samples.

major comments (2)

[Theoretical development and method adaptation] The Attribution Scaling Bias theorem (stated in the abstract and developed in the theoretical section) presupposes that the adapted Aumann-Shapley path-integral method is a valid attribution operator obeying the four axioms at N=1.6e6. However, standard discrete Aumann-Shapley requires an explicitly evaluable value function v(S) along coordinate paths or a continuum limit; the manuscript's claim of satisfying the axioms 'without additional assumptions on agent interactions' and achieving 10^4-10^5 speedup via aggregate statistics needs a detailed derivation showing axiom compliance (efficiency, symmetry, dummy, additivity) is preserved rather than conditional on independence or approximation.
[Empirical results] In the empirical evaluation on Bluesky data (1,671,587 users), the structural disagreement between full-scale and visibility-biased N=100 attributions is presented as evidence for the theorem, but the specific nonlinear macro indicator is not formalized with an equation; without this, it is unclear whether the observed long-tail vs. high-follower attribution difference is general for any nonlinear indicator or tied to the particular choice and its computation at scale.

minor comments (2)

[Abstract and experimental setup] The abstract states the method 'runs four to five orders of magnitude faster than sampled Shapley on the same hardware' but lacks a table or section with exact timing benchmarks, hardware specs, and baseline implementation details for reproducibility.
[Introduction and preliminaries] Notation for the macro indicator and the precise statement of the four axioms as adapted should be introduced earlier with explicit equations to aid readers unfamiliar with the Aumann-Shapley literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to strengthen the theoretical and empirical sections.

read point-by-point responses

Referee: [Theoretical development and method adaptation] The Attribution Scaling Bias theorem (stated in the abstract and developed in the theoretical section) presupposes that the adapted Aumann-Shapley path-integral method is a valid attribution operator obeying the four axioms at N=1.6e6. However, standard discrete Aumann-Shapley requires an explicitly evaluable value function v(S) along coordinate paths or a continuum limit; the manuscript's claim of satisfying the axioms 'without additional assumptions on agent interactions' and achieving 10^4-10^5 speedup via aggregate statistics needs a detailed derivation showing axiom compliance (efficiency, symmetry, dummy, additivity) is preserved rather than conditional on independence or approximation.

Authors: We appreciate the referee's emphasis on rigorous axiom verification. The manuscript's Section 3 presents the adaptation by defining the value function v as the macro indicator computed over the agent set, with the Aumann-Shapley integral evaluated via a continuum limit using aggregate statistics from the full population. This construction ensures the axioms hold by the properties of the path integral, independent of specific agent interaction models, as the marginal contributions are integrated without assuming independence. The computational speedup arises from using precomputed aggregates rather than per-coalition evaluations. To fully address the concern, we will include an expanded appendix in the revised manuscript with a step-by-step derivation verifying each axiom (efficiency, symmetry, dummy, additivity) for the adapted operator at large N, confirming no hidden assumptions are required. revision: yes
Referee: [Empirical results] In the empirical evaluation on Bluesky data (1,671,587 users), the structural disagreement between full-scale and visibility-biased N=100 attributions is presented as evidence for the theorem, but the specific nonlinear macro indicator is not formalized with an equation; without this, it is unclear whether the observed long-tail vs. high-follower attribution difference is general for any nonlinear indicator or tied to the particular choice and its computation at scale.

Authors: We agree that an explicit equation for the macro indicator would improve clarity. In the Bluesky experiments, the nonlinear macro indicator is the aggregate engagement metric, defined as M(x) = sum over posts of (likes + reposts + replies) where x represents the vector of user activity levels, incorporating nonlinear visibility thresholds. We will add this formal definition, along with the equation, to Section 4 in the revised version. This will demonstrate that the theorem holds for general nonlinear indicators, and the observed structural differences (long-tail attribution at full scale versus concentration on high-follower accounts in small samples) exemplify the scaling bias without being specific to this choice. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper adapts the established Aumann-Shapley path-integral method to satisfy the four standard axioms at million-agent scale and then derives the Attribution Scaling Bias theorem as a general consequence for any nonlinear macro indicator. The theorem states that no global rescaling factor can reconcile small-scale and full-scale attributions; this follows directly from the axiomatic properties rather than from any fitted parameters, self-referential definitions, or data-dependent constructions in the present work. The empirical comparison on Bluesky data (full N=1.6M vs. N=100 subsample) is a direct computation and does not feed back into the theorem. No load-bearing self-citations, uniqueness theorems imported from the authors' prior work, or smuggled ansatzes appear in the derivation; the adaptation is presented as preserving the axioms without additional interaction assumptions that would collapse the result to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the four standard Shapley axioms holding after adaptation and on the nonlinearity of the macro indicator; no free parameters or new entities are introduced in the abstract.

axioms (2)

standard math Aumann-Shapley path-integral attribution satisfies the four standard axioms (efficiency, symmetry, dummy, additivity)
Invoked when claiming the adapted method satisfies all four axioms at scale.
domain assumption Macro indicators of emergence are nonlinear
Required for the Attribution Scaling Bias theorem to apply.

pith-pipeline@v0.9.0 · 5593 in / 1343 out tokens · 47144 ms · 2026-05-13T02:20:29.337919+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

f heat(zS) = log(1 + m_a(S) m_b(S) m_c(S)), ... f var ... f gini ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Flexible Coding of in-depth Interviews: A Twenty- rst Century Approach

doi: 10.1017/pan.2023.2. Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli. Emergent social conventions and collective bias in LLM populations.Science Advances, 11(20):eadu9368,

work page doi:10.1017/pan.2023.2 2023
[2]

doi:10.1073/pnas.1804840115 , author =

doi: 10.1073/pnas.1804840115. Robert M. Bond, Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. A 61-million-person experiment in social influence and political mobilization.Nature, 489(7415):295–298,

work page doi:10.1073/pnas.1804840115
[3]

Markus K

doi: 10.1038/nature11421. Markus K. Brunnermeier. Deciphering the liquidity and credit crunch 2007–2008.Journal of Economic Perspectives, 23(1):77–100,

work page doi:10.1038/nature11421 2007
[4]

Stephen M

doi: 10.1257/jep.23.1.77. Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track,

work page doi:10.1257/jep.23.1.77
[5]

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li

doi: 10.1371/journal.pone.0310330. Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents,

work page doi:10.1371/journal.pone.0310330
[6]

Jordan Hoffmann et al

doi: 10.1287/mnsc.2015.2158. Jordan Hoffmann et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page doi:10.1287/mnsc.2015.2158 2015
[7]

Jared Kaplan et al

doi: 10.14778/3342263.3342637. Jared Kaplan et al. Scaling laws for neural language models,

work page doi:10.14778/3342263.3342637
[8]

ISBN 9798400701320

doi: 10.1145/3586183.3606763. Jinghua Piao et al. AgentSociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society,

work page doi:10.1145/3586183.3606763
[9]

GenSim: A general social simulation platform with large language model based agents

Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, and Ji-Rong Wen. GenSim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comp...

work page 2025
[10]

Interpreting emergent extreme events in multi-agent systems.arXiv preprint arXiv:2601.20538,

11 Ling Tang, Jilin Mei, Dongrui Liu, Chen Qian, Dawei Cheng, Jing Shao, and Xia Hu. Interpreting emergent extreme events in multi-agent systems.arXiv preprint arXiv:2601.20538,

work page arXiv
[11]

Jiachen T

doi: 10.1126/science.aap9559. Jiachen T. Wang and Ruoxi Jia. Data Banzhaf: A robust data valuation framework for machine learning. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 6388–6421,

work page doi:10.1126/science.aap9559
[12]

uniformly ini. Empirically, on f heat at Mythos with N= 10 4, the relative L1 error of ˆϕK against the analytic reference (Appendix F) decays as 1/K 2 across K∈ {5,10,20,30,50,100,300} , falling from 4.0×10 −3 at K= 5 to 1.1×10 −6 at K= 300 (Table 5). The wall-clock cost grows linearly in K, so where no analytic expression is available we useK= 30 as a de...

work page 2026
[13]

for the same MAS attribution problem; sampled Banzhaf with 200 coalition samples [Wang and Jia, 2023]; and two LLM-as-Judge variants, MAST [Cemri et al., 2025] and Who&When [Zhang et al., 2025b], both prompt-based scorers operating on agent execution traces. Metrics.MAE: mean absolute error against ground truth.Cosine: cosine similarity of attribution vec...

work page 2023
[14]

and adds a complementary rank-correlation summary. Setup.For each topic and each value function f∈ {f lin, f heat, f var, f gini}, we collect ten attribution runs at N= 10 2 under the visibility-biased sampling protocol of Section 3.1; each run yields a vector ˜ϕS ∈R |S| of normalized within-S shares. For the same i∈S we read the corresponding entries of ...

work page 2026
[15]

and Seckin et al. [2025]. Events are filtered to four record types relevant to engagement: posts, replies, reposts, and follows. Bots and accounts created within the window are removed by user-handle heuristics. The cleaned panel contains 1,671,587 active users (each with at least one event in the window). No private data is accessed at any stage. Per-age...

work page 2025
[16]

and Riquelme and González-Cantergiani [2016]. We do not standardise the features to zero mean and unit variance, because the raw scale of a, b, c is what controls how the four value functions weight different agents (and standardisation interacts non-trivially with the saturation inf heat); we instead document this as a known limitation in Section

work page 2016
[17]

Topic selection.Five topics covering technology (Mythos), politics (Trump-Tariffs), sports (The Masters), society (Earth Day-Climate), and entertainment (WrestleMania) are reported in this paper. Each topic was selected to be active throughout the14-day window with at least 1,000 participants on each of at least 10 days, and to span a single recognisable ...

work page 2024
[18]

Each scenario inherits the original system’s macro indicator, modified only to be C 2 where the original used non-smooth components (e.g. 25 Table 18: Three-tier shares (RGtop , RGmid , RGtail) on Mythos with f heat as a function of sample size N, under visibility-biased and random sampling, with the full panel as the bottom row. Mean over ten subset seed...

work page 2024
[19]

Ours” is the Aumann–Shapley attribution of Appendix F. “τ vs. Shapley

wealth-amplified macro-pressure risk that aggregates excess demand, budget stress, concentration, systemic gap, and instability into a single scalar. The baseline action zeroes out work, consumption, and investment for the target step while preserving wealth. SocialLLM, a social propagation simulator withN= 20agents in the spirit of Stauffer and Meyer- Or...

work page 2004
[20]

yield top-10 overlap above 9/10 in every case; weighted sum and log aggregator are nearly indistinguishable (Kendallτ≥0.984). Topic and strategy consistency.The cross-scale flip persists across all five Bluesky topics and all four analytic value functions (main-text Table 2 and Table 13,20/20 cells positive), and across all three biased sampling protocols...

work page 2020