pith. sign in

arxiv: 2606.04978 · v1 · pith:CNUNK65Bnew · submitted 2026-06-03 · 💻 cs.CL · cs.CY· econ.GN· q-fin.EC

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Pith reviewed 2026-06-28 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.CYecon.GNq-fin.EC
keywords LLM alignmentrisk decision-makingSt. Petersburg paradoxbehavioral alignmentmechanism alignmentprompt variantsinstruction tuning
0
0 comments X

The pith

LLMs produce human-like finite bids in the St. Petersburg game but switch to rational calculations under prompt variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests LLMs on the St. Petersburg game, where humans typically offer low finite amounts despite infinite expected value. Most of 28 models give similar finite bids in the base version, creating an appearance of aligned risk behavior. When prompts alter truncation, repetition, endowment, or identity, the models often adopt conditionally rational strategies instead of keeping human-like patterns. Human-perspective prompts and instruction tuning lower some bids and reduce odd outputs, yet the patterns across variants stay largely the same. This demonstrates that outcome resemblance can occur without mechanism consistency, so evaluations need to probe deeper than final answers.

Core claim

In the original St. Petersburg game most of the 28 LLMs generate finite bids that resemble human reports, yet the controlled variants reveal that models shift toward conditionally and computationally rational behavior rather than preserving human-like response patterns. Human-cue prompting and instruction tuning lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms.

What carries the argument

Structured prompt suite with controlled variants that perturb truncation, repeated play, numeric endowment, and occupational identity, together with a human-perspective prompt, used to separate outcome-level resemblance from mechanism-level alignment.

If this is right

  • Outcome similarity in risk tasks does not imply mechanism consistency across decision variants.
  • Instruction tuning and human-perspective prompts change some output values without aligning core response patterns to game changes.
  • Evaluations of LLM decision-making must include mechanism probes in addition to final choice similarity.
  • The St. Petersburg game with systematic perturbations serves as a testbed for detecting surface-only behavioral alignment.
  • High-stakes applications should examine whether alignment rests on consistent internal rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment methods that match outputs alone may leave gaps in other uncertainty-heavy tasks such as planning or ethics.
  • Training objectives could add consistency checks across prompt variants to target mechanism alignment directly.
  • The variant-probe approach could extend to multi-step or interactive decision settings to test deeper consistency.
  • Release evaluations might incorporate mechanism metrics alongside standard benchmark scores.

Load-bearing premise

The structured prompt variants isolate genuine differences in the models' underlying decision mechanisms rather than simply changing how the models phrase their answers.

What would settle it

If models produced the same finite bid patterns and did not shift toward rational calculations when truncation is removed or the game is repeated many times, the claim of mechanism-level misalignment would not hold.

Figures

Figures reproduced from arXiv: 2606.04978 by Changyu Chen, Chensong Huang, Chenwei Lin, Hanjia Lyu, Jiebo Luo, Xian Xu.

Figure 1
Figure 1. Figure 1: Conceptual overview of the study. The St. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The prompt for RQ1. The prompt describes the original St. Petersburg game, including the repeated coin-flipping mechanism and exponentially increasing payout structure, and asks the model to report the max￾imum amount it is willing to pay to participate. Re￾sponses are constrained to a single numerical dollar value for quantitative analysis. ent professional identities associated with different levels of r… view at source ↗
Figure 3
Figure 3. Figure 3: Maximum willingness-to-pay distributions in the original game. Each density summarizes the model-level [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism probe response distributions. Each row reports the number of models out of 28 classified [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Human-cue prompt state transitions at τ = 0. The Sankey diagram summarizes mechanism state transitions from plain prompts to the minimal human￾cue across 112 mechanism probe profiles. Human-cue prompt improves 23 of 112 mechanism transitions but leaves 73 of 112 unchanged. Please zoom in for details [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instruction tuning state transitions at τ = 0 for matched base/instruct model pairs. The Sankey diagram summarizes mechanism state transitions from base to instruct across 42 mechanism probe profiles. Instruction tuning lowers willingness to pay in 25 of 48 comparisons, but only 10 of 42 mechanism-state transitions improve toward the human-like region, and 30 of 42 remain unchanged. Please zoom in for deta… view at source ↗
Figure 7
Figure 7. Figure 7: Experimental framework. The upper panel defines the benchmark, the [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt for RQ3. The prompt prepends a minimal human-identity cue before the original St. Pe￾tersburg game description to encourage human-oriented reasoning. The model is then asked to report the max￾imum amount it is willing to pay to participate, with responses constrained to a single numerical dollar value for quantitative analysis. the main-text conclusion: instruction tuning often reshapes visible … view at source ↗
Figure 10
Figure 10. Figure 10: EV-first prompt template [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human-cue state transitions at τ = 0.7 under the plain prompt setting. The Sankey diagram tracks mechanism-state transitions from plain prompts to the minimal human cue. The upper summary bar reports state-transition categories over the four mechanism probes, while the lower summary bar reports willingness-to-pay changes over the original game and mechanism profiles [PITH_FULL_IMAGE:figures/full_fig_p016… view at source ↗
Figure 12
Figure 12. Figure 12: Instruction-tuning state transitions at τ = 0.7 under the plain prompt setting. The Sankey diagram compares matched base and instruction-tuned models. The upper summary bar reports mechanism-state transitions over the four mechanism probes, while the lower summary bar reports willingness-to-pay changes including the original game [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Original-game willingness-to-pay distributions under EV-first prompt. The figure mirrors the RQ1 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mechanism-probe distributions under EV-first prompt. Each row reports the number of models classified [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Human-cue state transitions under EV-first prompt at [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Human-cue state transitions under EV-first prompt at [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Instruction-tuning state transitions under EV-first prompt at [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Instruction-tuning state transitions under EV-first prompt at [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
read the original abstract

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates 28 LLMs on the St. Petersburg paradox using a structured prompt suite that includes the base game, variants perturbing truncation/repeated play/numeric endowment/occupational identity, a human-perspective instruction, and base vs. instruction-tuned model pairs. It reports that most models produce finite bids resembling humans in the original game, but shift toward conditionally rational responses under variants; human-cue prompting and instruction tuning lower bids without substantially altering most mechanism-level patterns. The central claim is that this demonstrates surface-level behavioral alignment without human-consistent mechanisms.

Significance. If the central claim holds after addressing verification gaps, the work would be significant for LLM alignment research by providing a controlled testbed (the St. Petersburg game) to separate outcome resemblance from mechanism consistency. The use of multiple controlled variants and paired base/instruction-tuned comparisons is a methodological strength that enables falsifiable distinctions. This could usefully inform high-stakes decision evaluations by shifting focus beyond surface outputs.

major comments (3)
  1. [Methods, prompt suite] Methods section (prompt suite description): The claim that variants and the human-perspective prompt successfully isolate mechanism-level differences (rather than inducing surface output shifts via in-context imitation) is load-bearing for the surface-vs-mechanism conclusion, yet the manuscript provides no independent verification such as logit inspection, attention analysis, or consistency metrics across prompt phrasings.
  2. [Results] Results section: The assertion that 'most mechanism-level response patterns remain largely unchanged' under human-cue prompting and instruction tuning lacks quantitative support (e.g., percentage of models exhibiting change, statistical tests, or effect sizes), making it difficult to evaluate whether the patterns are truly stable or merely described qualitatively.
  3. [Results and discussion] Results and discussion: No human participant data are collected on the variant prompts (truncation, repeated play, endowment, identity), so the reference point for 'human-consistent mechanisms' is limited to the original game; this weakens the interpretation that model shifts indicate divergence from human mechanisms under the same conditions.
minor comments (2)
  1. [Methods] The abstract states that 28 models were evaluated but the methods should explicitly list the models (open vs. closed source) and any exclusion criteria to support reproducibility.
  2. [Results] Figure or table presenting per-model bids or pattern classifications would benefit from error bars or confidence intervals given the stochastic nature of LLM outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential significance of the work for LLM alignment research. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Methods, prompt suite] Methods section (prompt suite description): The claim that variants and the human-perspective prompt successfully isolate mechanism-level differences (rather than inducing surface output shifts via in-context imitation) is load-bearing for the surface-vs-mechanism conclusion, yet the manuscript provides no independent verification such as logit inspection, attention analysis, or consistency metrics across prompt phrasings.

    Authors: We agree that internal analyses such as logit inspection or attention maps would provide stronger direct evidence for mechanism isolation. Our design instead relies on observable behavioral shifts under controlled prompt perturbations to distinguish surface resemblance from deeper alignment. We will revise the methods and discussion to explicitly acknowledge the absence of internal verification as a limitation and to elaborate on how the multi-variant design reduces the likelihood of simple in-context imitation. This constitutes a partial revision, as we cannot add new internal-model experiments without substantial additional work. revision: partial

  2. Referee: [Results] Results section: The assertion that 'most mechanism-level response patterns remain largely unchanged' under human-cue prompting and instruction tuning lacks quantitative support (e.g., percentage of models exhibiting change, statistical tests, or effect sizes), making it difficult to evaluate whether the patterns are truly stable or merely described qualitatively.

    Authors: The referee correctly identifies that the stability claim is currently qualitative. In the revised manuscript we will add quantitative support by reporting, for each mechanism-level pattern, the percentage of models that exhibit change versus stability under the human-cue and instruction-tuning conditions, together with any applicable statistical comparisons or effect-size measures permitted by the data. revision: yes

  3. Referee: [Results and discussion] Results and discussion: No human participant data are collected on the variant prompts (truncation, repeated play, endowment, identity), so the reference point for 'human-consistent mechanisms' is limited to the original game; this weakens the interpretation that model shifts indicate divergence from human mechanisms under the same conditions.

    Authors: This is a fair observation. Human benchmarks for the original game rest on a large existing literature, while the variants are motivated by theoretical predictions from decision theory. We will revise the results and discussion sections to clarify that mechanism-level comparisons for the variants are grounded in theoretical human expectations rather than newly collected empirical data, and we will note this boundary condition explicitly. New human data collection on the variants would strengthen the claims but lies beyond the scope of the present study. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical prompt study with direct observations

full rationale

The paper is a purely empirical investigation that evaluates 28 LLMs on St. Petersburg game variants via structured prompts and reports observed output patterns. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any claimed result to its own inputs by construction. All findings are presented as direct measurements from prompt responses, with no load-bearing theoretical steps that invoke uniqueness theorems, ansatzes, or renamed known results. The central claim rests on the experimental design itself rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical behavioral study; no mathematical free parameters, axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5800 in / 1099 out tokens · 27877 ms · 2026-06-28T06:03:46.683752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Econometrica , volume =

    Bernoulli, Daniel , title =. Econometrica , volume =. 1954 , doi =

  2. [2]

    Econometrica , volume =

    Tversky, Amos and Kahneman, Daniel , title =. Econometrica , volume =. 1979 , doi =

  3. [3]

    and Johnson, Eric J

    Thaler, Richard H. and Johnson, Eric J. , title =. Management Science , volume =. 1990 , doi =

  4. [4]

    Journal of Risk and Uncertainty , volume =

    Seidl, Christian , title =. Journal of Risk and Uncertainty , volume =. 2013 , doi =

  5. [5]

    and Kroll, Eike B

    Cox, James C. and Kroll, Eike B. and Lichters, Marcel and Sadiraj, Vjollca and Vogt, Bodo , title =. Business Research , volume =. 2019 , doi =

  6. [6]

    Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume =

    Peters, Ole , title =. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume =. 2011 , doi =

  7. [7]

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F. and Leike, Jan and L...

  8. [8]

    Nature Computational Science , volume =

    Hagendorff, Thilo and Fabi, Sarah and Kosinski, Michal , title =. Nature Computational Science , volume =. 2023 , doi =

  9. [9]

    Cognitive Bias in Decision-Making with

    Echterhoff, Jessica Maria and Liu, Yao and Alessa, Abeer and McAuley, Julian and He, Zexue , booktitle =. Cognitive Bias in Decision-Making with. 2024 , month = nov, address =. doi:10.18653/v1/2024.findings-emnlp.739 , pages =

  10. [10]

    How Personality Traits Shape

    Hartley, John and Hamill, Conor Brian and Seddon, Dale and Batra, Devesh and Okhrati, Ramin and Khraishi, Raad , editor =. How Personality Traits Shape. Findings of the Association for Computational Linguistics: ACL 2025 , month = jul, year =. doi:10.18653/v1/2025.findings-acl.1085 , pages =

  11. [11]

    May 2024 State Occupational Employment and Wage Estimates , year =

  12. [12]

    and Kroll, Eike B

    Cox, James C. and Kroll, Eike B. and Lichters, Marcel and Sadiraj, Vjollca and Vogt, Bodo , journal =. The. 2019 , doi =

  13. [13]

    Econometrica , volume =

    Prospect Theory: An Analysis of Decision under Risk , author =. Econometrica , volume =. 1979 , doi =

  14. [14]

    Science , volume =

    The Framing of Decisions and the Psychology of Choice , author =. Science , volume =. 1981 , doi =

  15. [15]

    Management Science , volume =

    Gambling with the House Money and Trying to Break Even: The Effects of Prior Outcomes on Risky Choice , author =. Management Science , volume =. 1990 , doi =

  16. [16]

    2022 , eprint =

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. 2022 , eprint =

  17. [17]

    and Hatfield-Dodds, Zac and Mann, Ben and Amodei, Dario and Joseph, Nicholas and McCandlish, Sam and Brown, Tom and Kaplan, Jared , year =

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  18. [18]

    Minds and Machines , volume =

    Artificial Intelligence, Values, and Alignment , author =. Minds and Machines , volume =. 2020 , doi =

  19. [19]

    2024 , eprint =

    A Roadmap to Pluralistic Alignment , author =. 2024 , eprint =

  20. [20]

    V alue B ench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

    Ren, Yuanyi and Ye, Haoran and Fang, Hanjun and Zhang, Xin and Song, Guojie. V alue B ench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.111

  21. [21]

    Nature Machine Intelligence , volume=

    The benefits, risks and bounds of personalizing the alignment of large language models to individuals , author=. Nature Machine Intelligence , volume=. 2024 , publisher=

  22. [22]

    Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =

    Lyu, Hanjia and Luo, Jiebo and Kang, Jian and Koenecke, Allison , title =. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2025 , isbn =. doi:10.1145/3715275.3732182 , abstract =

  23. [23]

    Le comportement de l'homme rationnel devant le risque: critique des postulats et axiomes de l'

    Allais, Maurice , journal=. Le comportement de l'homme rationnel devant le risque: critique des postulats et axiomes de l'. 1953 , publisher=

  24. [24]

    The quarterly journal of economics , volume=

    Risk, ambiguity, and the Savage axioms , author=. The quarterly journal of economics , volume=. 1961 , publisher=

  25. [25]

    arXiv preprint arXiv:2404.09690 , year=

    Harnessing gpt-4v (ision) for insurance: A preliminary exploration , author=. arXiv preprint arXiv:2404.09690 , year=

  26. [26]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  27. [27]

    Available at SSRN 5574261 , year=

    Fairness, Value Alignment, and Governance in LLMs: The Case of Insurance Pricing , author=. Available at SSRN 5574261 , year=

  28. [28]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Investorbench: A benchmark for financial decision-making tasks with llm-based agent , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  29. [29]

    arXiv preprint arXiv:2510.02209 , year=

    Stockbench: Can llm agents trade stocks profitably in real-world markets? , author=. arXiv preprint arXiv:2510.02209 , year=

  30. [30]

    Nejm Ai , volume=

    MedAgentBench: a virtual EHR environment to benchmark medical LLM agents , author=. Nejm Ai , volume=. 2025 , publisher=

  31. [31]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    CliMedBench: a large-scale Chinese benchmark for evaluating medical large language models in clinical scenarios , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  32. [32]

    and Platt, Michael L

    Hayden, Benjamin Y. and Platt, Michael L. , journal =. The Mean, the Median, and the. 2009 , doi =

  33. [33]

    LLM -Rec: Personalized Recommendation via Prompting Large Language Models

    Lyu, Hanjia and Jiang, Song and Zeng, Hanqing and Xia, Yinglong and Wang, Qifan and Zhang, Si and Chen, Ren and Leung, Chris and Tang, Jiajie and Luo, Jiebo. LLM -Rec: Personalized Recommendation via Prompting Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.39

  34. [34]

    Journal of Business and Psychology , volume =

    Financial Risk Tolerance and Additional Factors That Affect Risk Taking in Everyday Money Matters , author =. Journal of Business and Psychology , volume =. 2000 , doi =

  35. [35]

    Companion Proceedings of the ACM on Web Conference 2025 , pages =

    Qi, Weihong and Lyu, Hanjia and Luo, Jiebo , title =. Companion Proceedings of the ACM on Web Conference 2025 , pages =. 2025 , isbn =. doi:10.1145/3701716.3715591 , abstract =