pith. sign in

arxiv: 2607.02464 · v1 · pith:PK3OFZXEnew · submitted 2026-07-02 · 💻 cs.CL

Will Scaling Improve Social Simulation with LLMs?

Pith reviewed 2026-07-03 14:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelssocial simulationscaling lawsopinion modelingbehavioral simulationlongitudinal forecasting
0
0 comments X

The pith

Scaling laws show LLM social simulations improve for most populations but stall on biases and low-resource groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the current scaling paradigm in language modeling will close the fidelity gap in LLM-based social simulations or if simulation quality is largely independent of general capabilities. It tests this by measuring simulation accuracy against human data across opinion modeling, behavioral simulation, and longitudinal forecasting, using controlled pre-training runs at different compute scales plus evaluations of larger open models. The results indicate that accuracy rises reliably with scale when populations and tasks align with English web text, yet gains slow or vanish for longitudinal forecasts, minority opinions, and calibration to human cognitive biases such as risk aversion. A sympathetic reader would care because this distinguishes which simulation applications can ride the existing scaling wave and which will require separate research attention.

Core claim

Strong compute scaling appears across the three sub-domains when tasks involve populations well-represented in training data, enabling downstream accuracy to be predicted from pre-training loss; however, scaling does not improve calibration to human cognitive biases or heuristics, and both longitudinal forecasting and underrepresented opinions improve more slowly and correlate less strongly with benchmarks such as MMLU.

What carries the argument

Scaling laws that map pre-training compute budgets (10^18 to 10^20 FLOPs) and model sizes up to 70B parameters to measured fidelity of social simulations against human ground truth in the three sub-domains.

If this is right

  • The majority of behavioral and opinion simulation tasks will rapidly improve with scale, especially for populations well-represented in English web corpora.
  • Longitudinal forecasting and underrepresented opinions will scale more slowly and show weaker correlation with general knowledge benchmarks.
  • Model calibration with human cognitive biases such as risk aversion and with certain learning heuristics will not improve noticeably with scale, even after fine-tuning.
  • Improvements from scaling will be less reliable in low-resource domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • As models scale, simulations may increasingly over-represent majority views captured in web text while lagging on minority perspectives.
  • General capability gains measured by MMLU may not reliably transfer to modeling of human decision heuristics or biases.
  • Targeted data curation or architectural changes may be required to close the gaps that pure scaling leaves open.
  • Longer-horizon forecasting tasks may need explicit temporal modeling rather than relying on continued pre-training scale.

Load-bearing premise

The three chosen sub-domains and the specific tasks inside them are representative enough of real-world social simulation needs to support general conclusions about scaling.

What would settle it

Accuracy on behavioral simulation tasks that require matching human risk aversion stays flat or declines when model size increases from 8B to 70B on the same evaluation sets.

Figures

Figures reproduced from arXiv: 2607.02464 by Caleb Ziems, David Grusky, Diyi Yang, Su Doga Karaca, Tatsunori Hashimoto, William Held.

Figure 1
Figure 1. Figure 1: Compute Scaling Laws. We observe log-linear improvements in task loss on all three social science tasks after scaling compute alone with models trained on DCLM (Li et al., 2024) from 1018 to 1020 FLOPs. Specifically, we evaluate a suite of 85 decoder-only Transformers with the Qwen3 architec￾ture (Yang et al., 2025) trained on the DCLM pre-training corpus (Li et al., 2024) with fixed compute budgets Cm ran… view at source ↗
Figure 2
Figure 2. Figure 2: Compute-optimal scaling and downstream forecasting. Left: Compute scaling laws for Hilbig & Moshagen (2014), a subtask of Psych-101. Right: We show this loss correlates with accuracy sigmoidally, so loss can serve as a proxy for downstream progress. of the model variation on general tasks (Ili´c & Gignac, 2024; Burnell et al., 2023). This common capability space allows direct comparison across families in … view at source ↗
Figure 3
Figure 3. Figure 3: Observational Scaling Laws for WVS. Top: The five countries with the strongest correlation to general performance, r 2 > 0.4. Bottom: The five countries with the weakest correlation, r 2 < 0.2. 5.1 Results on WVS: Opinion Simulation Correlating General and Social Performance. Opinion simulation is not uniformly aligned with general capabilities [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pre-training biases predict observational scaling laws. We find a Spearman correlation of ρ = 0.8 and a Pearson correlation of r = 0.7 between observational fit and pre-training term frequency (p < 0.05), which supports our conclusions distributional imbal￾ances in LLM pre-training data explain observational scaling discrepancies. The scaling discrepancies above are consistent with distributional imbalance… view at source ↗
Figure 5
Figure 5. Figure 5: Observational Scaling Laws for Psych-101 on representative subtasks. We cluster tasks by their experimental hypothesis domain, including associative learning and cognitive biases, but we do not observe any strong relationship between domain and scaling behavior. 4 2 0 2 4 6 PC1 0.8 1.0 1.2 Task Loss r2 = 0.60 slope = -0.04 1.4 1.3 1.2 1.1 1.0 0.9 Task Loss 50% 55% 60% 65% Accuracy (%) slope = 0.14 r2 = 0.4… view at source ↗
Figure 6
Figure 6. Figure 6: Observational Scaling and Downstream Forecasting for ACL. Left: Observational scaling law with task loss. Right: We show that loss linearly correlates with accuracy. r = −0.34), and least strongly correlated with multi-step reasoning (MuSR r = −0.04) and programming abilities (HumanEval r = −0.10). Predicting Downstream Utility At Scale. The plot on the right side reveals a linear rela￾tionship between tas… view at source ↗
Figure 7
Figure 7. Figure 7: Parameter Scaling after Finetuning on Psych-101. We finetune Qwen2.5 and Llama3 at different scales on each task respectively. Highlighted green are experiments in which the largest model’s advantage over the smallest model is statistically significant (p < 0.05). Left: For tasks with weak compute scaling, we observe little or no evidence of parameter scaling. Right: For tasks with strong compute scaling, … view at source ↗
Figure 8
Figure 8. Figure 8: The Model’s Negative Log Likelihood (NLL) on the Majority Answer Generally Predicts Accuracy Sigmoidally for WVS tasks (r 2 > 0.4 for 16 out of 17 tasks) A Prompts, Task Design, and Evaluation Protocols Scaling laws can be highly sensitive to the manner in which tasks are operationalized, prompts are formatted, and metrics are defined. With respect to metrics, raw log￾probabilities generally scale smoothly… view at source ↗
Figure 9
Figure 9. Figure 9: The Model’s Negative Log Likelihood (NLL) on the Correct Answer Generally Predicts Accuracy Sigmoidally for the majority of Psych-101 tasks (r 2 > 0.4 for 15 out of 24 tasks) 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Compute Scaling Laws for WVS. We observe log-linear improvements in task loss on all WVS subtasks by scaling compute alone with models trained on DCLM (Li et al., 2024) from 1018 to 1020 FLOPs. B Additional Compute Scaling Laws We provide detailed plots of compute scaling laws for each subtask of WVS in [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Compute Scaling Laws for Psych-101. We observe log-linear improvements in task loss on all Psych-101 subtasks by scaling compute alone with models trained on DCLM (Li et al., 2024) from 1018 to 1020 FLOPs. However, the slope is not significant with p < 0.05 for Plonsky et al. (2018), somerville2017charting, bahrami2020four, Sadeghiyeh et al. (2020), Waltz et al. (2020), and Schulz et al. (2020). 28 [PITH… view at source ↗
read the original abstract

Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether LLM scaling improves fidelity in social simulations across three sub-domains (opinion modeling, behavioral simulation, longitudinal forecasting). Using 85 Qwen3 transformers pretrained on the DCLM corpus under fixed compute budgets ($10^{18}$ to $10^{20}$ FLOPs) plus 35 larger open-weight models up to 70B, it reports strong compute scaling for most tasks (especially well-represented populations correlated with English web text and MMLU), slower scaling for longitudinal forecasting and underrepresented opinions, and failure to improve calibration on cognitive biases or heuristics even after fine-tuning.

Significance. If the results hold, the work supplies controlled empirical evidence that general scaling laws extend to many social-simulation tasks while identifying concrete outliers (bias calibration, low-resource domains). The explicit compute-budget pretraining of 85 models and downstream loss-to-accuracy prediction constitute a reproducible strength that can guide future simulation research.

major comments (3)
  1. [Discussion / Conclusion] The central claim that scaling improves fidelity 'in most settings' rests on the representativeness of the three chosen sub-domains and their concrete tasks; the manuscript provides no quantitative sampling argument or coverage analysis showing these tasks capture the broader distribution of social-simulation needs (e.g., multi-agent coordination, cultural transmission, non-English contexts).
  2. [§3 / Results] §3 (Experimental Setup) and the results sections omit full task definitions, exact metrics, statistical procedures, and error analysis; without these the reported 'rapid improvement' and 'strong compute scaling' cannot be independently verified.
  3. [Behavioral Simulation Results] The claim that scaling fails for bias-calibration and heuristic tasks is supported only up to 8B parameters (fine-tuned); it is unclear whether the same pattern persists in the 35 larger models evaluated up to 70B or whether the loss-to-accuracy regression was applied to these outlier tasks.
minor comments (2)
  1. [§3] Notation for the loss-to-accuracy mapping and the precise definition of 'fidelity' should be stated explicitly in the methods.
  2. [Figures] Figure captions and axis labels for the scaling plots should include the exact number of models and compute range for each curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive report and the recognition of the paper's controlled experimental design. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [Discussion / Conclusion] The central claim that scaling improves fidelity 'in most settings' rests on the representativeness of the three chosen sub-domains and their concrete tasks; the manuscript provides no quantitative sampling argument or coverage analysis showing these tasks capture the broader distribution of social-simulation needs (e.g., multi-agent coordination, cultural transmission, non-English contexts).

    Authors: We selected the three sub-domains because they represent the most common categories in existing LLM social simulation literature (opinion polling, individual decision tasks, and multi-step forecasting). We agree that the manuscript lacks a quantitative coverage analysis or sampling argument. In revision we will add a dedicated paragraph in the Discussion section that explicitly lists the scope limitations, including the absence of multi-agent coordination and non-English settings, and note that the 'most settings' claim is conditioned on the evaluated task distribution rather than a formal statistical sample of all possible social simulation needs. revision: partial

  2. Referee: [§3 / Results] §3 (Experimental Setup) and the results sections omit full task definitions, exact metrics, statistical procedures, and error analysis; without these the reported 'rapid improvement' and 'strong compute scaling' cannot be independently verified.

    Authors: All task definitions, exact metrics (including accuracy, calibration error, and correlation coefficients), statistical procedures (bootstrap confidence intervals and regression details), and error analysis appear in the appendix. We acknowledge that the main text does not sufficiently cross-reference these details. In the revised manuscript we will expand §3 with concise definitions and metric formulas for the primary tasks and add explicit pointers to the appendix for full specifications and statistical methods. revision: yes

  3. Referee: [Behavioral Simulation Results] The claim that scaling fails for bias-calibration and heuristic tasks is supported only up to 8B parameters (fine-tuned); it is unclear whether the same pattern persists in the 35 larger models evaluated up to 70B or whether the loss-to-accuracy regression was applied to these outlier tasks.

    Authors: The bias-calibration and heuristic tasks were evaluated exclusively on the 85 controlled Qwen3 models (0.5B–8B) with fine-tuning; the 35 larger open-weight models were used only for the loss-to-accuracy regressions on the main opinion and behavioral tasks. We will revise the Behavioral Simulation Results section to state this scope explicitly and note that the loss-to-accuracy analysis was not applied to the bias-calibration outliers because those tasks were not run on the larger models. We will also add a sentence indicating that extrapolation from the observed flat scaling up to 8B suggests limited improvement at larger scales, while acknowledging the absence of direct measurements beyond 8B. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements against human data

full rationale

The paper derives its conclusions from explicit evaluations of 85 Qwen3 models and 35 larger open-weight models on opinion modeling, behavioral simulation, and longitudinal forecasting tasks, measuring fidelity directly against human data and observing compute scaling trends. No equation or claim reduces a prediction to a fitted parameter by construction, no self-citation bears the load of the central result, and no ansatz or uniqueness theorem is imported to force the outcome. The scaling relationships and outlier identification are data-driven observations, not self-referential definitions or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical scaling study. No theoretical axioms, free parameters fitted to the target result, or new invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5849 in / 1019 out tokens · 35668 ms · 2026-07-03T14:20:16.552693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 36 canonical work pages · 15 internal anchors

  1. [1]

    Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234,

    Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234,

  2. [2]

    Revealing the structure of language model capabilities.arXiv preprint arXiv:2306.10062,

    Ryan Burnell, Han Hao, Andrew RA Conway, and Jose Hernandez Orallo. Revealing the structure of language model capabilities.arXiv preprint arXiv:2306.10062,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  6. [6]

    LLM-in-the-loop: Leveraging large lan- guage models for thematic analysis

    Shih-Chieh Dai, Aohan Xiong, and Lun-Wei Ku. LLM-in-the-loop: Leveraging large lan- guage models for thematic analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

  7. [7]

    Can generative ai agents behave like humans? evidence from laboratory market experiments.arXiv preprint arXiv:2505.07457,

    R Maria del Rio-Chanona, Marco Pangallo, and Cars Hommes. Can generative ai agents behave like humans? evidence from laboratory market experiments.arXiv preprint arXiv:2505.07457,

  8. [8]

    Take caution in using llms as human surrogates: Scylla ex machina.arXiv preprint arXiv:2410.19599,

    Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates: Scylla ex machina.arXiv preprint arXiv:2410.19599,

  9. [9]

    Llms model non-weird populations: Experiments with synthetic cultural agents.arXiv preprint arXiv:2501.06834,

    Augusto Gonzalez-Bonorino, Monica Capra, and Emilio Pantoja. Llms model non-weird populations: Experiments with synthetic cultural agents.arXiv preprint arXiv:2501.06834,

  10. [10]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Kobi Hackenburg, Ben M Tappin, Paul Röttger, Scott A Hale, Jonathan Bright, and Helen Margetts. Scaling language model size yields diminishing returns for single-message political persuasion.Proceedings of the National Academy of Sciences, 122(10):e2413443122,

  11. [11]

    Evaluating large language models in generating synthetic hci research data: a case study

    Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. Evaluating large language models in generating synthetic hci research data: a case study. InProceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–19,

  12. [12]

    William Held, David Hall, Percy Liang, and Diyi Yang

    Open Athena Blog. William Held, David Hall, Percy Liang, and Diyi Yang. Relative scaling laws for llms.arXiv preprint arXiv:2510.24626,

  13. [13]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  14. [14]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  15. [15]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  16. [16]

    House, Sarah A

    James S. House, Sarah A. Burgard, Margaret T. Hicken, and Paula M. Lantz. Americans’ changing lives: Waves i–vi, 1986–2021,

  17. [17]

    SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

    Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors.arXiv preprint arXiv:2510.17516,

  18. [18]

    Donald trumps in the virtual polls: Simulating and predicting public opinions in surveys using large language models.arXiv preprint arXiv:2411.01582,

    Shapeng Jiang, Lijia Wei, and Chen Zhang. Donald trumps in the virtual polls: Simulating and predicting public opinions in surveys using large language models.arXiv preprint arXiv:2411.01582,

  19. [19]

    AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

    Junsol Kim and Byungkyu Lee. Ai-augmented surveys: Leveraging large language models and surveys for opinion prediction.arXiv preprint arXiv:2305.09620,

  20. [20]

    Finetuning llms for human behavior prediction in social science experiments

    Akaash Kolluri, Shengguang Wu, Joon Sung Park, and Michael S Bernstein. Finetuning llms for human behavior prediction in social science experiments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 30084–30099,

  21. [21]

    Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.Advances in Neural Information Processing Systems, 35:1950–1965,

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.Advances in Neural Information Processing Systems, 35:1950–1965,

  22. [22]

    Peterson, Ilia Sucholutsky, and Thomas L

    Ryan Liu, Jiayi Geng, Joshua C Peterson, Ilia Sucholutsky, and Thomas L Griffiths. Large language models assume people are more rational than we really are.arXiv preprint arXiv:2406.17055,

  23. [23]

    Ludwig, M

    T. Ludwig, M. Siegel, and E. Schulz. Human multi-task learning: The why and what. http://dx.doi.org/10.32470/CCN.2023.1528-0,

  24. [24]

    Scaling laws for economic productivity: Experimental evidence in llm-assisted translation.arXiv preprint arXiv:2409.02391,

    Ali Merali. Scaling laws for economic productivity: Experimental evidence in llm-assisted translation.arXiv preprint arXiv:2409.02391,

  25. [25]

    From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563,

    Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint arXiv:2412.03563,

  26. [26]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Mered- ith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024a. Peter S Park, Philipp Schoenegger, and Chongyang Zhu. Diminished diversity-of-thought in a standard large language model...

  27. [27]

    Limited ability of llms to simulate human psychological behaviours: a psychometric analysis.arXiv preprint arXiv:2405.07248,

    Nikolay B Petrov, Gregory Serapio-García, and Jason Rentfrow. Limited ability of llms to simulate human psychological behaviours: a psychometric analysis.arXiv preprint arXiv:2405.07248,

  28. [28]

    Qwen2.5 Technical Report

    Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

  29. [29]

    Llm economicus? mapping the behavioral biases of llms via utility theory.arXiv preprint arXiv:2408.02784,

    Jillian Ross, Yoon Kim, and Andrew W Lo. Llm economicus? mapping the behavioral biases of llms via utility theory.arXiv preprint arXiv:2408.02784,

  30. [30]

    Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, and Sanmi Koyejo

    URLhttps://arxiv.org/abs/2304.15004. Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, and Sanmi Koyejo. Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?,

  31. [31]

    Hannah Schroeder, Marine Aubin Le Quéré, Cristina Randazzo, David Mimno, and Sarita Schoenebeck

    URL https://arxiv.org/abs/2406.04391. Hannah Schroeder, Marine Aubin Le Quéré, Cristina Randazzo, David Mimno, and Sarita Schoenebeck. Large language models in qualitative research: Uses, tensions, and inten- tions. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–17,

  32. [32]

    Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. InInternational Conference on Learning Representations, volume 2024, pp. 25055–25083,

  33. [33]

    Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto

    URLhttps://huggingface.co/blog/codelion/optimal-dataset-mixing/. Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research.arXiv preprint arXiv:2601.14525,

  34. [34]

    Spectrum tuning: Post-training for distribu- tional coverage and in-context steerability.arXiv preprint arXiv:2510.06084,

    Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, and Yejin Choi. Spectrum tuning: Post-training for distribu- tional coverage and in-context steerability.arXiv preprint arXiv:2510.06084,

  35. [35]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049,

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049,

  36. [36]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    URL http://www.incompleteideas.net/IncIdeas/ BitterLesson.html. Blog post. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

  37. [37]

    Systematic biases in llm simulations of debates

    Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in llm simulations of debates. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 251–267,

  38. [38]

    Not yet: Large language models cannot replace human respondents for psychometric research.OSF Preprint: https://doi

    Pengda Wang, Huiqi Zou, Zihan Yan, Feng Guo, Tianjun Sun, Ziang Xiao, and Bo Zhang. Not yet: Large language models cannot replace human respondents for psychometric research.OSF Preprint: https://doi. org/10.31219/osf. io/rwy9b, 2024a. 21 Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Z...

  39. [39]

    Humans use directed and random exploration to solve the explore–exploit dilemma

    Robert C Wilson, Andra Geana, John M White, Elliot A Ludvig, and Jonathan D Cohen. Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of experimental psychology: General, 143(6):2074,

  40. [40]

    World values survey wave 7 (2017–2022),

    World Values Survey Association. World values survey wave 7 (2017–2022),

  41. [41]

    Llm-based social simulations require a boundary.arXiv preprint arXiv:2506.19806,

    Zengqing Wu, Run Peng, Takayuki Ito, Makoto Onizuka, and Chuan Xiao. Llm-based social simulations require a boundary.arXiv preprint arXiv:2506.19806,

  42. [42]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581,

  43. [43]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  45. [45]

    Mind the Sim2Real gap in user simulation for agentic tasks.arXiv preprint arXiv:2603.11245, 2026

    Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, et al. Mind the sim2real gap in user simulation for agentic tasks.arXiv preprint arXiv:2603.11245,

  46. [46]

    22 1.01.21.41.61.8 30% 40% 50% 60% r2 = 0.927 Canada 1.21.41.61.82.02.2 20% 25% 30% 35% 40% 45% 50% r2 = 0.821 Korea 1.41.61.82.0 20% 30% 40% 50% r2 = 0.805 Nigeria 1.41.61.82.0 25% 30% 35% 40% 45% 50% r2 = 0.774 Pakistan 1.251.501.752.002.25 20% 25% 30% 35% 40% 45% r2 = 0.754 Hong Kong 1.41.61.82.0 25% 30% 35% 40% 45% 50% r2 = 0.719 Bolivia 1.41.61.82.0 ...

  47. [47]

    Baby Boomer (1946 -

    B. Baby Boomer (1946 -

  48. [48]

    Generation X (1965 -

    C. Generation X (1965 -

  49. [49]

    Millennial (1981 -

    D. Millennial (1981 -

  50. [50]

    Generation Z (1997 -

    E. Generation Z (1997 -

  51. [51]

    Generation Alpha (2013 -

    F. Generation Alpha (2013 -

  52. [52]

    N/A Response: D

    G. N/A Response: D. Millennial (1981 -

  53. [53]

    T", "Y",

    Question: What is the highest educational level that you have attained? A. Primary education B. Lower secondary education C. Upper secondary education D. Post-secondary non-tertiary education E. Short-cycle tertiary education F. Bachelor or equivalent G. Master or equivalent H. Doctoral or equivalent I. N/A Response: A. Primary education Question: Which o...

  54. [54]

    However, the slope is not significant with p< 0.05 forPlonsky et al

    from 1018 to 1020 FLOPs. However, the slope is not significant with p< 0.05 forPlonsky et al. (2018), somerville2017charting, bahrami2020four, Sadeghiyeh et al. (2020), Waltz et al. (2020), andSchulz et al. (2020). 28 Subtask Category Sloper 2 rp-value Hilbig & Moshagen (2014) Decision-Making -0.03 0.72 0.85 0.00 Gershman (2020) Reward Maximization -0.03 ...

  55. [55]

    In Figure 4, we correlate the log frequency of city terms in the DCLM with the observational scaling law fit. We find a Spearman correlation ofρ= 0.8 and a Pearson correlation ofr= 0.7 between observational fit and pre-training term frequency (p< 0.05), which supports our conclusions distributional imbalances in LLM pre-training data explain observational...