pith. sign in

arxiv: 2606.24162 · v1 · pith:3VRZX6QLnew · submitted 2026-06-23 · 💻 cs.CL · cs.LG

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

Pith reviewed 2026-06-26 00:36 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords BehaviorBenchfoundation modelsbehavioral sciencedistributional alignmentBe.FMbehavior predictionsimulationpsychology
0
0 comments X

The pith

Fine-tuned behavioral models achieve stronger population-level alignment than general foundation models across behavioral science tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BehaviorBench evaluates foundation models on four capabilities: behavior prediction and simulation, strategic decision-making, subject-trait inference, and behavioral knowledge application. It measures performance at both individual accuracy and population distributional alignment, showing proprietary general-purpose models lead on individual predictions and knowledge tasks while behavioral models fine-tuned on behavioral data excel at matching group-level patterns. The paper introduces Be.FM-1.5, which leads distributional metrics and stays competitive individually. This matters because behavioral validity in psychology, sociology, and economics requires models to reproduce not just single responses but how entire populations behave.

Core claim

BehaviorBench demonstrates a clear performance gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models fine-tuned on behavioral data achieve substantially stronger distributional alignment. Be.FM-1.5 leads on distributional metrics while remaining competitive on individual-level metrics, indicating that targeted behavioral adaptation can close much of the gap across diverse tasks and populations.

What carries the argument

BehaviorBench benchmark that evaluates outputs at both individual and distributional levels across behavior prediction, strategic decision-making, trait inference, and knowledge application.

If this is right

  • Behavioral fine-tuning produces models that better simulate population responses in surveys and experiments.
  • Distributional evaluation becomes necessary alongside individual accuracy for assessing behavioral models.
  • Be.FM-1.5 provides a competitive base model for multiple behavioral science applications.
  • Adaptation on behavioral data can reduce reliance on proprietary general models for group-level studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether these distributional gains transfer to unbenchmarked domains like policy impact forecasting.
  • The dual evaluation approach might apply to other domains requiring both individual and collective accuracy, such as opinion dynamics modeling.
  • If distributional alignment proves predictive of real validity, it could guide data collection priorities for behavioral AI training.
  • Models optimized this way might enable more reliable agent-based simulations of economic or social systems.

Load-bearing premise

The selected tasks and distributional alignment metrics represent the core requirements for behavioral validity in science.

What would settle it

A new behavioral task or population where models scoring high on BehaviorBench distributional metrics fail to match observed real-world group behaviors.

Figures

Figures reproduced from arXiv: 2606.24162 by Jin Huang, Matthew O. Jackson, Qiaozhu Mei, Walter Yuan, Wanli Song, Xingjian Zhang, Yutong Xie.

Figure 1
Figure 1. Figure 1: Aggregated evaluation results of foundation models on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-round behavior prediction accu￾racy on the Push/Pull game, which is an unseen context during Be.FM-1.5’s training. Generalizing to unseen subjects. BehaviorBench contains held-out sub￾jects in the training of Be.FM-1.5, and we can examine how fine-tuning enables generalization to these unseen subjects. Both Be.FM-1.5 variants improve over their respective backbone models across all four behavioral ca… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of model outputs in single-round game behavior simulation (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of model outputs in single-round game behavior simulation (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of model outputs in multi-round game behavior prediction (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of model outputs in multi-round game behavior prediction (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of model outputs in single-round game behavior prediction given observations [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of model outputs in single-round game behavior prediction given observations [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
read the original abstract

Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations. We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity. Leveraging the tasks in BehaviorBench, we further develop Be.FM-1.5, extending the Be.FM family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, Be.FM-1.5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate Be.FM-1.5's potential for a broad range of behavioral science studies. Our BehaviorBench and Be.FM-1.5 models can be accessed via https://umich-foreseer.github.io/behaviorbench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BehaviorBench, a benchmark for foundation models on behavioral science tasks across four capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. It evaluates at both individual and distributional levels, develops Be.FM-1.5 (a behavioral foundation model fine-tuned on behavioral data), and reports that proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks while Be.FM-1.5 leads on distributional alignment and remains competitive individually.

Significance. If the tasks prove representative and the metrics valid, the work would be significant for establishing a standardized benchmark in behavioral science applications of AI and for demonstrating that behavioral fine-tuning can improve distributional alignment. The open release of BehaviorBench and the Be.FM-1.5 models is a clear strength supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that Be.FM-1.5 leads on distributional metrics (and that this constitutes an essential requirement for behavioral validity) cannot be assessed because the manuscript supplies no concrete task definitions, population sampling details, metric formulas, or statistical tests.
  2. [Methods (absent)] The manuscript provides no details on task construction or validation of distributional metrics (full text placeholder contains only the abstract), which is load-bearing for the reported performance gaps between proprietary models and Be.FM-1.5.
minor comments (1)
  1. [Abstract] Abstract: consider adding a sentence on the total number of tasks or models evaluated to convey scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback highlighting the need for explicit methodological transparency. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Be.FM-1.5 leads on distributional metrics (and that this constitutes an essential requirement for behavioral validity) cannot be assessed because the manuscript supplies no concrete task definitions, population sampling details, metric formulas, or statistical tests.

    Authors: We agree the abstract alone cannot support independent assessment of the claims. The full manuscript contains dedicated sections: task definitions and examples in Section 3, population sampling procedures in Section 3.1, metric formulas (including individual accuracy, distributional alignment via Wasserstein distance and KL divergence) in Section 4.2, and statistical tests (bootstrap confidence intervals and significance testing) in Section 5.3. We will revise the abstract to include concise summaries of the four task categories, the dual evaluation levels, and the key metrics. This addresses the concern while preserving the abstract's brevity. revision: partial

  2. Referee: [Methods (absent)] The manuscript provides no details on task construction or validation of distributional metrics (full text placeholder contains only the abstract), which is load-bearing for the reported performance gaps between proprietary models and Be.FM-1.5.

    Authors: The submitted manuscript includes a full Methods section (Section 3) describing task construction: tasks were drawn from established behavioral science datasets and experiments (e.g., survey items from psychology studies, game-theoretic scenarios from economics), with population sampling details (demographic stratification and sample sizes) and expert validation for behavioral fidelity. Distributional metric validation appears in Section 4.3, including checks against human population statistics and sensitivity analyses. We will expand these sections with additional pseudocode, explicit formulas, and a new appendix table summarizing each task's source, sampling, and metric computation to ensure the performance gaps are fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with no derivations or self-referential reductions

full rationale

The paper introduces BehaviorBench as an empirical evaluation framework across four capability categories and reports comparative results for proprietary models versus Be.FM-1.5 on individual-level versus distributional metrics. No equations, parameter-fitting procedures, uniqueness theorems, or derivation chains appear in the abstract or described content. The central claims rest on task definitions and observed performance gaps rather than any step that reduces by construction to the paper's own inputs or prior self-citations. This is a standard empirical benchmark study whose validity can be assessed externally via the released tasks and models; no load-bearing circularity is present.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that the four listed capabilities and the individual-plus-distributional evaluation together constitute a valid test of behavioral alignment; no free parameters or invented entities are described.

axioms (2)
  • domain assumption The four core capabilities (behavior prediction, strategic decision-making, subject-trait inference, behavioral knowledge application) cover the essential requirements for behavioral science tasks.
    Stated directly in the abstract as the basis for constructing BehaviorBench.
  • domain assumption Distributional alignment is an essential requirement for behavioral validity.
    Explicitly called out in the abstract as a crucial evaluation dimension.

pith-pipeline@v0.9.1-grok · 5839 in / 1404 out tokens · 21884 ms · 2026-06-26T00:36:06.723345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 18 canonical work pages · 6 internal anchors

  1. [1]

    Hudson and Ehsan Adeli and Russ B

    Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ B. Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and D...

  2. [2]

    , title =

    Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =. 2023 , url =. doi:10.1145/3586183.3606763 , timestamp =

  3. [3]

    Nature , volume=

    Scientific discovery in the age of artificial intelligence , author=. Nature , volume=. 2023 , publisher=

  4. [4]

    2014 , publisher=

    The bounds of reason: game theory and the unification of the behavioral sciences-revised edition , author=. 2014 , publisher=

  5. [5]

    AI Behavioral Science

    Matthew O. Jackson and Qiaozhu Mei and Stephanie W. Wang and Yutong Xie and Walter Yuan and Seth Benzell and Erik Brynjolfsson and Colin F. Camerer and James Evans and Brian Jabarian and Jon M. Kleinberg and Juanjuan Meng and Sendhil Mullainathan and Asuman Ozdaglar and Thomas Pfeiffer and Moshe Tennenholtz and Robb Willer and Diyi Yang and Teng Ye , titl...

  6. [6]

    Nature Reviews Psychology , volume=

    Using large language models in psychology , author=. Nature Reviews Psychology , volume=. 2023 , publisher=

  7. [7]

    Proceedings of the National Academy of Sciences , volume=

    AI emerges as the frontier in behavioral science , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  8. [8]

    Proceedings of the National Academy of Sciences , volume=

    Can generative AI improve social science? , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  9. [9]

    Political Analysis , volume=

    Out of one, many: Using language models to simulate human samples , author=. Political Analysis , volume=. 2023 , publisher=

  10. [10]

    Science , volume=

    AI and the transformation of social science research , author=. Science , volume=. 2023 , publisher=

  11. [11]

    preprint , year=

    Large language models can be used to scale the ideologies of politicians in a zero-shot learning setting , author=. preprint , year=

  12. [12]

    Large language models can rate news outlet credibility , journal =

    Kai. Large language models can rate news outlet credibility , journal =. 2023 , url =. doi:10.48550/ARXIV.2304.00228 , eprinttype =. 2304.00228 , timestamp =

  13. [13]

    Proceedings of the National Academy of Sciences , volume=

    ChatGPT outperforms crowd workers for text-annotation tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=

  14. [14]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park and Carolyn Q. Zou and Aaron Shaw and Benjamin Mako Hill and Carrie J. Cai and Meredith Ringel Morris and Robb Willer and Percy Liang and Michael S. Bernstein , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2411.10109 , eprinttype =. 2411.10109 , timestamp =

  15. [15]

    Preprint , year=

    Predicting results of social science experiments using large language models , author=. Preprint , year=

  16. [16]

    Royal Society Open Science , volume=

    Can large language models help predict results from a complex behavioural science study? , author=. Royal Society Open Science , volume=. 2024 , publisher=

  17. [17]

    Be.FM: Open Foundation Models for Human Behavior , journal =

    Yutong Xie and Zhuoheng Li and Xiyuan Wang and Yijun Pan and Qijia Liu and Xingzhi Cui and Kuang. Be.FM: Open Foundation Models for Human Behavior , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.23058 , eprinttype =. 2505.23058 , timestamp =

  18. [18]

    arXiv preprint arXiv:2410.20268 , year=

    Centaur: a foundation model of human cognition , author=. arXiv preprint arXiv:2410.20268 , year=

  19. [19]

    Bernstein

    Akaash Kolluri and Shengguang Wu and Joon Sung Park and Michael S. Bernstein , editor =. Finetuning LLMs for Human Behavior Prediction in Social Science Experiments , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1530 , timestamp =

  20. [20]

    Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions , booktitle =

    Joseph Suh and Erfan Jahanparast and Suhong Moon and Minwoo Kang and Serina Chang , editor =. Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions , booktitle =. 2025 , url =

  21. [21]

    SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models , booktitle =

    Jia Wang and Ziyu Zhao and Tingjuntao Ni and Zhongyu Wei , editor =. SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1335 , timestamp =

  22. [22]

    CoRR , volume =

    Eilam Shapira and Omer Madmon and Itamar Reinman and Samuel Joseph Amouyal and Roi Reichart and Moshe Tennenholtz , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.05254 , eprinttype =. 2410.05254 , timestamp =

  23. [23]

    GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations , journal =

    Jinhao Duan and Renming Zhang and James Diffenderfer and Bhavya Kailkhura and Lichao Sun and Elias Stengel. GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations , journal =. 2024 , url =. doi:10.48550/ARXIV.2402.12348 , eprinttype =. 2402.12348 , timestamp =

  24. [24]

    Competing Large Language Models in Multi-Agent Gaming Environments , booktitle =

    Jen. Competing Large Language Models in Multi-Agent Gaming Environments , booktitle =. 2025 , url =

  25. [25]

    First Workshop on Social Simulation with LLMs , year=

    Distributional Alignment for Social Simulation with LLMs: A Prompt Mixture Modeling Approach , author=. First Workshop on Social Simulation with LLMs , year=

  26. [26]

    Proceedings of the National Academy of Sciences , volume=

    A Turing test of whether AI chatbots are behaviorally similar to humans , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  27. [27]

    MASSW: A new dataset and benchmark tasks for AI-assisted scientific workflows

    Xingjian Zhang and Yutong Xie and Jin Huang and Jinge Ma and Zhaoying Pan and Qijia Liu and Ziyang Xiong and Tolga Ergen and Dongsub Shim and Honglak Lee and Qiaozhu Mei , editor =. Findings of the Association for Computational Linguistics:. 2025 , url =. doi:10.18653/V1/2025.FINDINGS-NAACL.127 , timestamp =

  28. [28]

    Whose Opinions Do Language Models Reflect? , booktitle =

    Shibani Santurkar and Esin Durmus and Faisal Ladhak and Cinoo Lee and Percy Liang and Tatsunori Hashimoto , editor =. Whose Opinions Do Language Models Reflect? , booktitle =. 2023 , url =

  29. [29]

    Statistical methods in medical research , volume=

    Handling missing data in survey research , author=. Statistical methods in medical research , volume=. 1996 , publisher=

  30. [30]

    2019 , publisher=

    Statistical analysis with missing data , author=. 2019 , publisher=

  31. [31]

    CoRR , volume =

    Shangmin Guo and Haoran Bu and Haochuan Wang and Yi Ren and Dianbo Sui and Yuming Shang and Siting Lu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2401.01735 , eprinttype =. 2401.01735 , timestamp =

  32. [32]

    The American economic review , volume=

    Unraveling in guessing games: An experimental study , author=. The American economic review , volume=. 1995 , publisher=

  33. [33]

    p-beauty contests

    Iterated dominance and iterated best response in experimental" p-beauty contests" , author=. The American Economic Review , volume=. 1998 , publisher=

  34. [34]

    Information Processing & Management , volume=

    Click-through rate prediction in online advertising: A literature review , author=. Information Processing & Management , volume=. 2022 , publisher=

  35. [35]

    Synerise Monad:

    Barbara Rychalska and Szymon Lukasik and Jacek Dabrowski , editor =. Synerise Monad:. Proceedings of the 46th International. 2023 , url =. doi:10.1145/3539618.3591851 , timestamp =

  36. [36]

    , author=

    The policy relevance of personality traits. , author=. American psychologist , volume=. 2019 , publisher=

  37. [37]

    BLEURT : Learning Robust Metrics for Text Generation

    Thibault Sellam and Dipanjan Das and Ankur P. Parikh , editor =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.ACL-MAIN.704 , timestamp =

  38. [38]

    Qwen3 Technical Report

    Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =

  39. [39]

    The Llama 3 Herd of Models

    Llama Team , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

  40. [40]

    Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

    Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

  41. [41]

    2024 , eprint=

    SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=

  42. [42]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =

  43. [43]

    2025 , url =

    Anthropic , title =. 2025 , url =

  44. [44]

    2026 , url =

    Anthropic , title =. 2026 , url =

  45. [45]

    2026 , url =

    OpenAI , title =. 2026 , url =

  46. [46]

    2025 , url =

    OpenAI , title =. 2025 , url =

  47. [47]

    2026 , url =

    Google , title =. 2026 , url =

  48. [48]

    Holistic Evaluation of Language Models

    Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Cosgrove and Christopher D. Manning and Christopher R. Holistic Evaluation of Language Models , journal =....

  49. [49]

    Proceedings of the 40th International Conference on Machine Learning , series=

    Using large language models to simulate multiple humans and replicate human subject studies , author=. Proceedings of the 40th International Conference on Machine Learning , series=. 2023 , organization=

  50. [50]

    Proceedings of the National Academy of Sciences , volume=

    Using large language models to categorize strategic situations and decipher motivations behind human behaviors , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=

  51. [51]

    Political Analysis , volume=

    Synthetic replacements for human survey data? The perils of large language models , author=. Political Analysis , volume=. 2024 , publisher=

  52. [52]

    Proceedings of the National Academy of Sciences , volume=

    Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=

  53. [53]

    CoRR , volume =

    Hongtao Liu and Zhicheng Du and Zihe Wang and Weiran Shen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.11944 , eprinttype =. 2508.11944 , timestamp =

  54. [54]

    Experimental economics , volume=

    Dictator games: A meta study , author=. Experimental economics , volume=. 2011 , publisher=

  55. [55]

    International journal of game theory , volume=

    Dictator game giving: Rules of fairness versus acts of kindness , author=. International journal of game theory , volume=. 1998 , publisher=

  56. [56]

    Econometrica , volume=

    Social image and the 50--50 norm: A theoretical and experimental analysis of audience effects , author=. Econometrica , volume=. 2009 , publisher=

  57. [57]

    Journal of Economic Psychology , volume=

    Minimal social cues in the dictator game , author=. Journal of Economic Psychology , volume=. 2009 , publisher=

  58. [58]

    Economic Theory , volume=

    Exploiting moral wiggle room: experiments demonstrating an illusory preference for fairness , author=. Economic Theory , volume=. 2007 , publisher=

  59. [59]

    Games and Economic behavior , volume=

    Preferences, property rights, and anonymity in bargaining games , author=. Games and Economic behavior , volume=. 1994 , publisher=

  60. [60]

    Journal of Economic Psychology , volume=

    Promoting helping behavior with framing in dictator games , author=. Journal of Economic Psychology , volume=. 2007 , publisher=

  61. [61]

    The Quarterly Journal of Economics , volume=

    Directed altruism and enforced reciprocity in social networks , author=. The Quarterly Journal of Economics , volume=. 2009 , publisher=

  62. [62]

    American Economic Journal: Microeconomics , volume=

    The 1/d law of giving , author=. American Economic Journal: Microeconomics , volume=. 2010 , publisher=

  63. [63]

    The economic journal , volume=

    Are women less selfish than men?: Evidence from dictator experiments , author=. The economic journal , volume=. 1998 , publisher=

  64. [64]

    Economic man

    “Economic man” in cross-cultural perspective: Behavioral experiments in 15 small-scale societies , author=. Behavioral and brain sciences , volume=. 2005 , publisher=