Calibrating Behavioral Parameters with Large Language Models
Pith reviewed 2026-05-16 08:54 UTC · model grok-4.3
The pith
Large language models can be calibrated with behavioral profiles to measure loss aversion, herding, and extrapolation at or above human benchmark levels for asset pricing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Profile-based calibration of LLMs induces large, stable, and theoretically coherent shifts in behavioral parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes, and calibrated extrapolation in an agent-based asset pricing model generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence.
What carries the argument
Profile-based prompting that treats LLMs as calibrated measurement instruments for eight canonical behavioral biases.
If this is right
- Calibrated parameters reach or exceed human benchmark magnitudes for loss aversion, herding, extrapolation, and anchoring.
- Calibrated extrapolation in an agent-based asset pricing model produces short-horizon momentum and long-horizon reversal consistent with empirical evidence.
- The framework supplies explicit measurement ranges and boundaries for eight canonical behavioral biases.
- Baseline LLM behavior exhibits systematic rationality bias including attenuated loss aversion and near-zero disposition effects.
Where Pith is reading between the lines
- The approach could allow researchers to generate large populations of heterogeneous agents with controlled bias profiles without new surveys or experiments.
- If the calibration functions prove stable, they could be reused across different market models to study interactions among multiple biases simultaneously.
- The method might extend to calibrating behavioral parameters in macroeconomic or policy simulation models where direct measurement is equally difficult.
Load-bearing premise
That prompting LLMs with behavioral profiles produces parameters that remain stable across models, scenarios, and time and that inserting those parameters into agent-based models yields dynamics that reflect human behavior rather than artifacts of the prompting process.
What would settle it
Running the same profile prompts on multiple LLMs at different times and finding that the extracted parameters for loss aversion or extrapolation vary by more than the reported stability margin, or finding that the agent-based model with calibrated extrapolation fails to produce momentum and reversal patterns when tested against new market data.
read the original abstract
Behavioral parameters such as loss aversion, herding, and extrapolation are central to asset pricing models but remain difficult to measure reliably. We develop a framework that treats large language models (LLMs) as calibrated measurement instruments for behavioral parameters. Using four models and 24{,}000 agent--scenario pairs, we document systematic rationality bias in baseline LLM behavior, including attenuated loss aversion, weak herding, and near-zero disposition effects relative to human benchmarks. Profile-based calibration induces large, stable, and theoretically coherent shifts in several parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes. To assess external validity, we embed calibrated parameters in an agent-based asset pricing model, where calibrated extrapolation generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence. Our results establish measurement ranges, calibration functions, and explicit boundaries for eight canonical behavioral biases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a framework treating LLMs as calibrated instruments for measuring behavioral parameters (loss aversion, herding, extrapolation, anchoring) in asset pricing. Using four models and 24,000 agent-scenario pairs, it reports baseline LLM rationality biases relative to human benchmarks, shows profile-based calibration produces large, stable, theoretically coherent parameter shifts reaching or exceeding benchmarks, and validates by embedding calibrated parameters in an agent-based asset pricing model where extrapolation generates short-horizon momentum and long-horizon reversal consistent with empirical evidence. The work establishes measurement ranges and explicit boundaries for eight biases.
Significance. If the calibration functions prove robust, the approach could supply a scalable method for quantifying parameters that are otherwise difficult to measure directly, improving the micro-foundations of agent-based models in finance. The multi-model design and ABM embedding step are constructive elements that ground the claims in both measurement and dynamic implications.
major comments (3)
- [Abstract] Abstract: the central claim that profile-based calibration induces 'large, stable, and theoretically coherent shifts' rests on summarized outcomes; the abstract supplies no prompting protocols, statistical significance tests, robustness checks to model choice, or exclusion criteria, leaving the reliability of the calibration functions unevaluated.
- [Results] Results section: results are reported from four LLMs and 24,000 pairs but no cross-model variance statistics or consistency metrics for the calibration mappings are provided, so the invariance assumption required for treating LLMs as stable measurement instruments remains untested.
- [ABM validation] ABM validation section: calibrated extrapolation is shown to generate momentum and reversal patterns 'consistent with empirical evidence,' yet both the calibration targets (human benchmarks) and the validation targets (market patterns) derive from observed human behavior, creating a moderate circularity risk that weakens the external-validity interpretation.
minor comments (2)
- [Abstract] The abstract states '24,000' without a comma; adopt consistent numeric formatting throughout.
- [Introduction] Define the eight canonical behavioral biases with explicit functional forms or references in the main text before presenting calibration results.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below. Revisions have been made to strengthen the presentation of methods, add cross-model statistics, and clarify the validation logic. We believe these changes improve the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that profile-based calibration induces 'large, stable, and theoretically coherent shifts' rests on summarized outcomes; the abstract supplies no prompting protocols, statistical significance tests, robustness checks to model choice, or exclusion criteria, leaving the reliability of the calibration functions unevaluated.
Authors: We agree the abstract is highly condensed. Due to length limits, it summarizes rather than details protocols. Prompting templates, exact statistical tests (t-tests and Wilcoxon on parameter shifts), model-robustness tables, and exclusion rules (e.g., responses with <80% coherence) are fully reported in Sections 2.2, 3.1, and 4.1. We have revised the abstract to add one sentence noting the four-model design, 24,000-pair sample, and robustness across LLMs. revision: yes
-
Referee: [Results] Results section: results are reported from four LLMs and 24,000 pairs but no cross-model variance statistics or consistency metrics for the calibration mappings are provided, so the invariance assumption required for treating LLMs as stable measurement instruments remains untested.
Authors: We accept this point. The original draft reported only pooled results. We have added a new subsection (3.3) that computes (i) standard deviation of each calibrated parameter across the four models, (ii) pairwise correlations of the calibration functions, and (iii) a consistency index (fraction of parameters whose sign and significance agree across models). These metrics are low-variance for loss aversion, herding, and extrapolation, supporting the invariance assumption. The revised tables are now included. revision: yes
-
Referee: [ABM validation] ABM validation section: calibrated extrapolation is shown to generate momentum and reversal patterns 'consistent with empirical evidence,' yet both the calibration targets (human benchmarks) and the validation targets (market patterns) derive from observed human behavior, creating a moderate circularity risk that weakens the external-validity interpretation.
Authors: We disagree that this constitutes circularity. Calibration targets are micro-level parameters recovered from controlled laboratory experiments (Kahneman & Tversky 1979; Barberis et al. 2016). Validation targets are macro-level return patterns documented in asset-pricing studies (Jegadeesh & Titman 1993; De Bondt & Thaler 1985). The ABM tests whether parameters fitted to individual experimental data can reproduce aggregate market regularities—an explicit micro-to-macro mapping that is not tautological. We have added a clarifying paragraph in Section 5.2 distinguishing the two data sources and noting that market patterns were never used in calibration. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper calibrates LLM outputs to match external human behavioral benchmarks for parameters such as loss aversion and extrapolation, then embeds the resulting values in a standard agent-based asset pricing model to check whether they reproduce known aggregate market patterns (short-horizon momentum, long-horizon reversal). These steps are independent: the calibration targets are micro-level individual biases drawn from separate human-subject studies, while the validation targets are macro-level price dynamics from market data. No equations, definitions, or self-citations reduce any claimed result to its own inputs by construction. The framework therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- profile descriptors for calibration
axioms (1)
- domain assumption LLMs can simulate human-like decision biases when appropriately prompted and calibrated
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Profile-based calibration induces large, stable, and theoretically coherent shifts in several parameters... embed calibrated parameters in an agent-based asset pricing model
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a framework that treats large language models (LLMs) as calibrated measurement instruments for behavioral parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gati Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies.International Conference on Machine Learning, pages 337–371, 2023
work page 2023
-
[2]
Information cascades in the laboratory.American Economic Review, 87(5): 847–862, 1997
Lisa R Anderson and Charles A Holt. Information cascades in the laboratory.American Economic Review, 87(5): 847–862, 1997
work page 1997
-
[3]
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023
work page 2023
-
[4]
Nicholas Barberis and Ming Huang. Stocks as lotteries: The implications of probability weighting for security prices.American Economic Review, 98(5):2066–2100, 2008
work page 2066
-
[5]
A model of investor sentiment.Journal of Financial Economics, 49(3):307–343, 1998
Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment.Journal of Financial Economics, 49(3):307–343, 1998
work page 1998
-
[6]
Extrapolation and bubbles.Journal of Financial Economics, 129(2):203–227, 2018
Nicholas Barberis, Robin Greenwood, Lawrence Jin, and Andrei Shleifer. Extrapolation and bubbles.Journal of Financial Economics, 129(2):203–227, 2018
work page 2018
-
[7]
Nicholas C Barberis. Thirty years of prospect theory in economics: A review and assessment.Journal of Economic Perspectives, 27(1):173–196, 2013
work page 2013
-
[8]
Victor L Bernard and Jacob K Thomas. Post-earnings-announcement drift: Delayed price response or risk premium?Journal of Accounting Research, 27:1–36, 1989
work page 1989
-
[9]
Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and cultural change as informational cascades.Journal of Political Economy, 100(5):992–1026, 1992
work page 1992
-
[10]
Robert Bloomfield and Jeffrey Hales. Predicting the next step of a random walk: Experimental evidence of regime-shifting beliefs.Journal of Financial Economics, 65(3):397–414, 2002
work page 2002
-
[11]
James Brand, Ayelet Israeli, and Donald Ngwe. Using gpt for market research. Marketing Unit Working Paper 23-062, Harvard Business School, 2023
work page 2023
-
[12]
Heterogeneous beliefs and routes to chaos in a simple asset pricing model
William A Brock and Cars H Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model. Journal of Economic Dynamics and Control, 22(8-9):1235–1274, 1998
work page 1998
-
[13]
Colin F Camerer. The promise and success of lab-field generalizability in experimental economics: A critical reply to levitt and list.Available at SSRN 1977749, 2011
work page 2011
-
[14]
Distinguishing informational cascades from herd behavior in the laboratory
Bogachan Celen and Shachar Kariv. Distinguishing informational cascades from herd behavior in the laboratory. American Economic Review, 94(3):484–498, 2004
work page 2004
-
[15]
Kent Daniel, David Hirshleifer, and Avanidhar Subrahmanyam. Investor psychology and security market under-and overreactions.Journal of Finance, 53(6):1839–1885, 1998
work page 1998
-
[16]
Thomas Dohmen, Armin Falk, David Huffman, Uwe Sunde, Jürgen Schupp, and Gert G Wagner. Individual risk attitudes: Measurement, determinants, and behavioral consequences.Journal of the European Economic Association, 9(3):522–550, 2011
work page 2011
-
[17]
Expectations of returns and expected returns.Review of Financial Studies, 27(3):714–746, 2014
Robin Greenwood and Andrei Shleifer. Expectations of returns and expected returns.Review of Financial Studies, 27(3):714–746, 2014
work page 2014
-
[18]
Risk aversion and incentive effects.American Economic Review, 92(5): 1644–1655, 2002
Charles A Holt and Susan K Laury. Risk aversion and incentive effects.American Economic Review, 92(5): 1644–1655, 2002
work page 2002
-
[19]
John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? National Bureau of Economic Research Working Paper, (31122), 2023
work page 2023
-
[20]
Can large language models simulate human behavior in economic experiments?Working Paper, 2024
John J Horton. Can large language models simulate human behavior in economic experiments?Working Paper, 2024
work page 2024
-
[21]
Narasimhan Jegadeesh and Sheridan Titman. Returns to buying winners and selling losers: Implications for stock market efficiency.Journal of Finance, 48(1):65–91, 1993
work page 1993
-
[22]
Prospect theory: An analysis of decision under risk.Econometrica, 47(2): 263–291, 1979
Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk.Econometrica, 47(2): 263–291, 1979
work page 1979
-
[23]
Daniel Kahneman, Jack L Knetsch, and Richard H Thaler. Experimental tests of the endowment effect and the coase theorem.Journal of Political Economy, 98(6):1325–1348, 1990
work page 1990
-
[24]
Michael P Keane. Structural vs. atheoretic approaches to econometrics.Journal of Econometrics, 156(1):3–20, 2011. 11 Large Language Models as Calibrated Measurement Instruments for Behavioral ParametersA PREPRINT
work page 2011
-
[25]
Contrarian investment, extrapolation, and risk.Journal of Finance, 49(5):1541–1578, 1994
Josef Lakonishok, Andrei Shleifer, and Robert W Vishny. Contrarian investment, extrapolation, and risk.Journal of Finance, 49(5):1541–1578, 1994
work page 1994
-
[26]
Blake LeBaron. Empirical regularities from interacting long-and short-memory investors in an agent-based stock market.IEEE Transactions on Evolutionary Computation, 5(5):442–455, 2001
work page 2001
-
[27]
Scaling and criticality in a stochastic multi-agent model of a financial market
Thomas Lux and Michele Marchesi. Scaling and criticality in a stochastic multi-agent model of a financial market. Nature, 397(6719):498–500, 1999
work page 1999
-
[28]
Quantifying and mitigating memorization in large language models.arXiv preprint, 2024
Qing Mei et al. Quantifying and mitigating memorization in large language models.arXiv preprint, 2024
work page 2024
-
[29]
The trouble with overconfidence.Psychological Review, 115(2):502–517, 2008
Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological Review, 115(2):502–517, 2008
work page 2008
-
[30]
Thomas Mussweiler, Fritz Strack, and Tim Pfeiffer. Overcoming the inevitable anchoring effect: Considering the opposite compensates for selective accessibility.Personality and Social Psychology Bulletin, 26(9):1142–1150, 2000
work page 2000
-
[31]
Gregory B Northcraft and Margaret A Neale. Experts, amateurs, and real estate: An anchoring-and-adjustment perspective on property pricing decisions.Organizational Behavior and Human Decision Processes, 39(1):84–97, 1987
work page 1987
-
[32]
The boundaries of loss aversion.Journal of Marketing Research, 42(2): 119–128, 2005
Nathan Novemsky and Daniel Kahneman. The boundaries of loss aversion.Journal of Marketing Research, 42(2): 119–128, 2005
work page 2005
-
[33]
Are investors reluctant to realize their losses?Journal of Finance, 53(5):1775–1798, 1998
Terrance Odean. Are investors reluctant to realize their losses?Journal of Finance, 53(5):1775–1798, 1998
work page 1998
-
[34]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, 2023
work page 2023
-
[35]
Hersh Shefrin and Meir Statman. The disposition to sell winners too early and ride losers too long: Theory and evidence.Journal of Finance, 40(3):777–790, 1985
work page 1985
-
[36]
Judgment under uncertainty: Heuristics and biases.Science, 185(4157): 1124–1131, 1974
Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases.Science, 185(4157): 1124–1131, 1974
work page 1974
-
[37]
Advances in prospect theory: Cumulative representation of uncertainty
Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992
work page 1992
-
[38]
Martin Weber and Colin F Camerer. The disposition effect in securities trading: An experimental analysis.Journal of Economic Behavior & Organization, 33(2):167–184, 1998. 12 Large Language Models as Calibrated Measurement Instruments for Behavioral ParametersA PREPRINT Appendix A Human Benchmark Justification This appendix provides detailed justification ...
work page 1998
-
[39]
Search each asset identifier on Google (exact phrase match)
-
[40]
Search on Bing, Yahoo Finance, Bloomberg Terminal
-
[41]
Search SEC EDGAR filings
-
[42]
Procedure documented and replicable
Search financial news archives (WSJ, FT, Bloomberg News) Zero exact matches confirm non-existence in accessible training data. Procedure documented and replicable. E.2 Power Analysis Details For each experiment, we compute power using simulation-based approach: Disposition Effect: • Null: DR = 1.0 (no bias) • Alternative: DR = 1.6 (human benchmark) • Samp...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.