BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
Pith reviewed 2026-06-26 00:36 UTC · model grok-4.3
The pith
Fine-tuned behavioral models achieve stronger population-level alignment than general foundation models across behavioral science tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BehaviorBench demonstrates a clear performance gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models fine-tuned on behavioral data achieve substantially stronger distributional alignment. Be.FM-1.5 leads on distributional metrics while remaining competitive on individual-level metrics, indicating that targeted behavioral adaptation can close much of the gap across diverse tasks and populations.
What carries the argument
BehaviorBench benchmark that evaluates outputs at both individual and distributional levels across behavior prediction, strategic decision-making, trait inference, and knowledge application.
If this is right
- Behavioral fine-tuning produces models that better simulate population responses in surveys and experiments.
- Distributional evaluation becomes necessary alongside individual accuracy for assessing behavioral models.
- Be.FM-1.5 provides a competitive base model for multiple behavioral science applications.
- Adaptation on behavioral data can reduce reliance on proprietary general models for group-level studies.
Where Pith is reading between the lines
- Future work could test whether these distributional gains transfer to unbenchmarked domains like policy impact forecasting.
- The dual evaluation approach might apply to other domains requiring both individual and collective accuracy, such as opinion dynamics modeling.
- If distributional alignment proves predictive of real validity, it could guide data collection priorities for behavioral AI training.
- Models optimized this way might enable more reliable agent-based simulations of economic or social systems.
Load-bearing premise
The selected tasks and distributional alignment metrics represent the core requirements for behavioral validity in science.
What would settle it
A new behavioral task or population where models scoring high on BehaviorBench distributional metrics fail to match observed real-world group behaviors.
Figures
read the original abstract
Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations. We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity. Leveraging the tasks in BehaviorBench, we further develop Be.FM-1.5, extending the Be.FM family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, Be.FM-1.5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate Be.FM-1.5's potential for a broad range of behavioral science studies. Our BehaviorBench and Be.FM-1.5 models can be accessed via https://umich-foreseer.github.io/behaviorbench/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BehaviorBench, a benchmark for foundation models on behavioral science tasks across four capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. It evaluates at both individual and distributional levels, develops Be.FM-1.5 (a behavioral foundation model fine-tuned on behavioral data), and reports that proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks while Be.FM-1.5 leads on distributional alignment and remains competitive individually.
Significance. If the tasks prove representative and the metrics valid, the work would be significant for establishing a standardized benchmark in behavioral science applications of AI and for demonstrating that behavioral fine-tuning can improve distributional alignment. The open release of BehaviorBench and the Be.FM-1.5 models is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the central claim that Be.FM-1.5 leads on distributional metrics (and that this constitutes an essential requirement for behavioral validity) cannot be assessed because the manuscript supplies no concrete task definitions, population sampling details, metric formulas, or statistical tests.
- [Methods (absent)] The manuscript provides no details on task construction or validation of distributional metrics (full text placeholder contains only the abstract), which is load-bearing for the reported performance gaps between proprietary models and Be.FM-1.5.
minor comments (1)
- [Abstract] Abstract: consider adding a sentence on the total number of tasks or models evaluated to convey scale.
Simulated Author's Rebuttal
We thank the referee for their detailed feedback highlighting the need for explicit methodological transparency. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Be.FM-1.5 leads on distributional metrics (and that this constitutes an essential requirement for behavioral validity) cannot be assessed because the manuscript supplies no concrete task definitions, population sampling details, metric formulas, or statistical tests.
Authors: We agree the abstract alone cannot support independent assessment of the claims. The full manuscript contains dedicated sections: task definitions and examples in Section 3, population sampling procedures in Section 3.1, metric formulas (including individual accuracy, distributional alignment via Wasserstein distance and KL divergence) in Section 4.2, and statistical tests (bootstrap confidence intervals and significance testing) in Section 5.3. We will revise the abstract to include concise summaries of the four task categories, the dual evaluation levels, and the key metrics. This addresses the concern while preserving the abstract's brevity. revision: partial
-
Referee: [Methods (absent)] The manuscript provides no details on task construction or validation of distributional metrics (full text placeholder contains only the abstract), which is load-bearing for the reported performance gaps between proprietary models and Be.FM-1.5.
Authors: The submitted manuscript includes a full Methods section (Section 3) describing task construction: tasks were drawn from established behavioral science datasets and experiments (e.g., survey items from psychology studies, game-theoretic scenarios from economics), with population sampling details (demographic stratification and sample sizes) and expert validation for behavioral fidelity. Distributional metric validation appears in Section 4.3, including checks against human population statistics and sensitivity analyses. We will expand these sections with additional pseudocode, explicit formulas, and a new appendix table summarizing each task's source, sampling, and metric computation to ensure the performance gaps are fully reproducible. revision: yes
Circularity Check
No circularity: empirical benchmark paper with no derivations or self-referential reductions
full rationale
The paper introduces BehaviorBench as an empirical evaluation framework across four capability categories and reports comparative results for proprietary models versus Be.FM-1.5 on individual-level versus distributional metrics. No equations, parameter-fitting procedures, uniqueness theorems, or derivation chains appear in the abstract or described content. The central claims rest on task definitions and observed performance gaps rather than any step that reduces by construction to the paper's own inputs or prior self-citations. This is a standard empirical benchmark study whose validity can be assessed externally via the released tasks and models; no load-bearing circularity is present.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The four core capabilities (behavior prediction, strategic decision-making, subject-trait inference, behavioral knowledge application) cover the essential requirements for behavioral science tasks.
- domain assumption Distributional alignment is an essential requirement for behavioral validity.
Reference graph
Works this paper leans on
-
[1]
Hudson and Ehsan Adeli and Russ B
Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ B. Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and D...
Pith/arXiv arXiv 2021
-
[2]
Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =. 2023 , url =. doi:10.1145/3586183.3606763 , timestamp =
-
[3]
Nature , volume=
Scientific discovery in the age of artificial intelligence , author=. Nature , volume=. 2023 , publisher=
2023
-
[4]
2014 , publisher=
The bounds of reason: game theory and the unification of the behavioral sciences-revised edition , author=. 2014 , publisher=
2014
-
[5]
Matthew O. Jackson and Qiaozhu Mei and Stephanie W. Wang and Yutong Xie and Walter Yuan and Seth Benzell and Erik Brynjolfsson and Colin F. Camerer and James Evans and Brian Jabarian and Jon M. Kleinberg and Juanjuan Meng and Sendhil Mullainathan and Asuman Ozdaglar and Thomas Pfeiffer and Moshe Tennenholtz and Robb Willer and Diyi Yang and Teng Ye , titl...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.13323 2025
-
[6]
Nature Reviews Psychology , volume=
Using large language models in psychology , author=. Nature Reviews Psychology , volume=. 2023 , publisher=
2023
-
[7]
Proceedings of the National Academy of Sciences , volume=
AI emerges as the frontier in behavioral science , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=
2024
-
[8]
Proceedings of the National Academy of Sciences , volume=
Can generative AI improve social science? , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=
2024
-
[9]
Political Analysis , volume=
Out of one, many: Using language models to simulate human samples , author=. Political Analysis , volume=. 2023 , publisher=
2023
-
[10]
Science , volume=
AI and the transformation of social science research , author=. Science , volume=. 2023 , publisher=
2023
-
[11]
preprint , year=
Large language models can be used to scale the ideologies of politicians in a zero-shot learning setting , author=. preprint , year=
-
[12]
Large language models can rate news outlet credibility , journal =
Kai. Large language models can rate news outlet credibility , journal =. 2023 , url =. doi:10.48550/ARXIV.2304.00228 , eprinttype =. 2304.00228 , timestamp =
-
[13]
Proceedings of the National Academy of Sciences , volume=
ChatGPT outperforms crowd workers for text-annotation tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=
2023
-
[14]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
Joon Sung Park and Carolyn Q. Zou and Aaron Shaw and Benjamin Mako Hill and Carrie J. Cai and Meredith Ringel Morris and Robb Willer and Percy Liang and Michael S. Bernstein , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2411.10109 , eprinttype =. 2411.10109 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.10109 2024
-
[15]
Preprint , year=
Predicting results of social science experiments using large language models , author=. Preprint , year=
-
[16]
Royal Society Open Science , volume=
Can large language models help predict results from a complex behavioural science study? , author=. Royal Society Open Science , volume=. 2024 , publisher=
2024
-
[17]
Be.FM: Open Foundation Models for Human Behavior , journal =
Yutong Xie and Zhuoheng Li and Xiyuan Wang and Yijun Pan and Qijia Liu and Xingzhi Cui and Kuang. Be.FM: Open Foundation Models for Human Behavior , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.23058 , eprinttype =. 2505.23058 , timestamp =
-
[18]
arXiv preprint arXiv:2410.20268 , year=
Centaur: a foundation model of human cognition , author=. arXiv preprint arXiv:2410.20268 , year=
-
[19]
Akaash Kolluri and Shengguang Wu and Joon Sung Park and Michael S. Bernstein , editor =. Finetuning LLMs for Human Behavior Prediction in Social Science Experiments , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1530 , timestamp =
-
[20]
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions , booktitle =
Joseph Suh and Erfan Jahanparast and Suhong Moon and Minwoo Kang and Serina Chang , editor =. Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions , booktitle =. 2025 , url =
2025
-
[21]
SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models , booktitle =
Jia Wang and Ziyu Zhao and Tingjuntao Ni and Zhongyu Wei , editor =. SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1335 , timestamp =
-
[22]
Eilam Shapira and Omer Madmon and Itamar Reinman and Samuel Joseph Amouyal and Roi Reichart and Moshe Tennenholtz , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.05254 , eprinttype =. 2410.05254 , timestamp =
-
[23]
Jinhao Duan and Renming Zhang and James Diffenderfer and Bhavya Kailkhura and Lichao Sun and Elias Stengel. GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations , journal =. 2024 , url =. doi:10.48550/ARXIV.2402.12348 , eprinttype =. 2402.12348 , timestamp =
-
[24]
Competing Large Language Models in Multi-Agent Gaming Environments , booktitle =
Jen. Competing Large Language Models in Multi-Agent Gaming Environments , booktitle =. 2025 , url =
2025
-
[25]
First Workshop on Social Simulation with LLMs , year=
Distributional Alignment for Social Simulation with LLMs: A Prompt Mixture Modeling Approach , author=. First Workshop on Social Simulation with LLMs , year=
-
[26]
Proceedings of the National Academy of Sciences , volume=
A Turing test of whether AI chatbots are behaviorally similar to humans , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=
2024
-
[27]
MASSW: A new dataset and benchmark tasks for AI-assisted scientific workflows
Xingjian Zhang and Yutong Xie and Jin Huang and Jinge Ma and Zhaoying Pan and Qijia Liu and Ziyang Xiong and Tolga Ergen and Dongsub Shim and Honglak Lee and Qiaozhu Mei , editor =. Findings of the Association for Computational Linguistics:. 2025 , url =. doi:10.18653/V1/2025.FINDINGS-NAACL.127 , timestamp =
-
[28]
Whose Opinions Do Language Models Reflect? , booktitle =
Shibani Santurkar and Esin Durmus and Faisal Ladhak and Cinoo Lee and Percy Liang and Tatsunori Hashimoto , editor =. Whose Opinions Do Language Models Reflect? , booktitle =. 2023 , url =
2023
-
[29]
Statistical methods in medical research , volume=
Handling missing data in survey research , author=. Statistical methods in medical research , volume=. 1996 , publisher=
1996
-
[30]
2019 , publisher=
Statistical analysis with missing data , author=. 2019 , publisher=
2019
-
[31]
Shangmin Guo and Haoran Bu and Haochuan Wang and Yi Ren and Dianbo Sui and Yuming Shang and Siting Lu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2401.01735 , eprinttype =. 2401.01735 , timestamp =
-
[32]
The American economic review , volume=
Unraveling in guessing games: An experimental study , author=. The American economic review , volume=. 1995 , publisher=
1995
-
[33]
p-beauty contests
Iterated dominance and iterated best response in experimental" p-beauty contests" , author=. The American Economic Review , volume=. 1998 , publisher=
1998
-
[34]
Information Processing & Management , volume=
Click-through rate prediction in online advertising: A literature review , author=. Information Processing & Management , volume=. 2022 , publisher=
2022
-
[35]
Barbara Rychalska and Szymon Lukasik and Jacek Dabrowski , editor =. Synerise Monad:. Proceedings of the 46th International. 2023 , url =. doi:10.1145/3539618.3591851 , timestamp =
-
[36]
, author=
The policy relevance of personality traits. , author=. American psychologist , volume=. 2019 , publisher=
2019
-
[37]
BLEURT : Learning Robust Metrics for Text Generation
Thibault Sellam and Dipanjan Das and Ankur P. Parikh , editor =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.ACL-MAIN.704 , timestamp =
-
[38]
Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[39]
Llama Team , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[40]
Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen
Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =
2022
-
[41]
2024 , eprint=
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=
2024
-
[42]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2512.02556 , eprinttype =. 2512.02556 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
-
[43]
2025 , url =
Anthropic , title =. 2025 , url =
2025
-
[44]
2026 , url =
Anthropic , title =. 2026 , url =
2026
-
[45]
2026 , url =
OpenAI , title =. 2026 , url =
2026
-
[46]
2025 , url =
OpenAI , title =. 2025 , url =
2025
-
[47]
2026 , url =
Google , title =. 2026 , url =
2026
-
[48]
Holistic Evaluation of Language Models
Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Cosgrove and Christopher D. Manning and Christopher R. Holistic Evaluation of Language Models , journal =....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2022
-
[49]
Proceedings of the 40th International Conference on Machine Learning , series=
Using large language models to simulate multiple humans and replicate human subject studies , author=. Proceedings of the 40th International Conference on Machine Learning , series=. 2023 , organization=
2023
-
[50]
Proceedings of the National Academy of Sciences , volume=
Using large language models to categorize strategic situations and decipher motivations behind human behaviors , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=
2025
-
[51]
Political Analysis , volume=
Synthetic replacements for human survey data? The perils of large language models , author=. Political Analysis , volume=. 2024 , publisher=
2024
-
[52]
Proceedings of the National Academy of Sciences , volume=
Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=
2023
-
[53]
Hongtao Liu and Zhicheng Du and Zihe Wang and Weiran Shen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.11944 , eprinttype =. 2508.11944 , timestamp =
-
[54]
Experimental economics , volume=
Dictator games: A meta study , author=. Experimental economics , volume=. 2011 , publisher=
2011
-
[55]
International journal of game theory , volume=
Dictator game giving: Rules of fairness versus acts of kindness , author=. International journal of game theory , volume=. 1998 , publisher=
1998
-
[56]
Econometrica , volume=
Social image and the 50--50 norm: A theoretical and experimental analysis of audience effects , author=. Econometrica , volume=. 2009 , publisher=
2009
-
[57]
Journal of Economic Psychology , volume=
Minimal social cues in the dictator game , author=. Journal of Economic Psychology , volume=. 2009 , publisher=
2009
-
[58]
Economic Theory , volume=
Exploiting moral wiggle room: experiments demonstrating an illusory preference for fairness , author=. Economic Theory , volume=. 2007 , publisher=
2007
-
[59]
Games and Economic behavior , volume=
Preferences, property rights, and anonymity in bargaining games , author=. Games and Economic behavior , volume=. 1994 , publisher=
1994
-
[60]
Journal of Economic Psychology , volume=
Promoting helping behavior with framing in dictator games , author=. Journal of Economic Psychology , volume=. 2007 , publisher=
2007
-
[61]
The Quarterly Journal of Economics , volume=
Directed altruism and enforced reciprocity in social networks , author=. The Quarterly Journal of Economics , volume=. 2009 , publisher=
2009
-
[62]
American Economic Journal: Microeconomics , volume=
The 1/d law of giving , author=. American Economic Journal: Microeconomics , volume=. 2010 , publisher=
2010
-
[63]
The economic journal , volume=
Are women less selfish than men?: Evidence from dictator experiments , author=. The economic journal , volume=. 1998 , publisher=
1998
-
[64]
Economic man
“Economic man” in cross-cultural perspective: Behavioral experiments in 15 small-scale societies , author=. Behavioral and brain sciences , volume=. 2005 , publisher=
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.