Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
Pith reviewed 2026-06-28 17:10 UTC · model grok-4.3
The pith
Large language models over-reveal private information in advisor roles by 1.8 to 4.2 times the level predicted by cheap-talk theory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the sender's ideal point is shifted from the receiver's by bias b, the four tested models produce messages whose normalized mutual information with the hidden state remains between 0.78 and 0.94 for every b in the tested grid, while the most-informative equilibrium requires values between 0.18 and 0.53; the models therefore transmit between 1.8 and 4.2 times more information than the strategic optimum allows, manifesting as near-full revelation accompanied by linear exaggeration rather than the coarse partitions theory requires.
What carries the argument
The Crawford-Sobel cheap-talk sender-receiver game, in which a sender observes a state omega in [0,1] and sends one costless message to a receiver whose action is chosen to match omega while the sender prefers action omega plus bias b.
If this is right
- Informativeness declines with larger bias but plateaus well above the equilibrium partition size.
- Switching the prompt frame from payoff maximization to explicit honesty instructions produces no measurable change in revelation level.
- The over-revelation pattern is recoverable only when the receiver is given the model's stated number; embedding-only decoding collapses the signal.
- Linear exaggeration of the state, rather than interval partitioning, is the dominant observed strategy across all models and bias levels.
Where Pith is reading between the lines
- Alignment evaluations that rely on stated preferences alone may miss the systematic upward bias these models introduce when transmitting numbers.
- The same benchmark design can be reused with different state spaces or multi-round interactions to test whether repeated play induces the coarse partitions theory expects.
- If downstream users treat the model's stated number as ground truth, the documented exaggeration will systematically shift their actions away from their own optimum by a fixed fraction of the bias.
Load-bearing premise
That the single numerical value a model writes in its message can be read by a rational receiver as the literal signal whose information content is being measured.
What would settle it
An experiment in which the same messages are fed to a receiver that sees only vector embeddings of the text and never the explicit number the model wrote; the ablation already shows this decoder recovers only near-babbling performance.
read the original abstract
Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript adapts the Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Using 5 bias levels (b=0.01 to 0.12), 3 prompt frames, fixed low temperature, and 200 states per cell (12,000 total trials), it compares observed normalized mutual information (NMI) from four instruction-tuned models against independently computed oracle values for the most-informative equilibria (partition sizes 7/4/3/2, oracle NMI 0.5294/0.3268/0.2205/0.1829). The central finding is that all models over-reveal (NMI 0.78-0.94) with a linear-exaggeration pattern rather than coarse partitions, and that framing has negligible effect; a decoder ablation confirms the result depends on the stated numerical message.
Significance. If the results hold, the work supplies a theory-grounded, pre-specified benchmark with independently derived oracle baselines and explicit controls (multiple frames, bias levels, decoder ablation). This is a concrete strength for alignment evaluation, as the comparison rests on no parameters fitted to LLM outputs and yields a falsifiable prediction (over-revelation with constant offset). The design allows direct testing of whether future models approach the strategic optimum.
major comments (2)
- [§3 (Oracle computation)] §3 (Oracle computation): The manuscript states the exact partition sizes and NMI values for each b but provides no derivation, formula reference, or verification step showing these match the Crawford-Sobel equilibrium conditions (e.g., the indifference condition at partition boundaries for the given bias). This verification is load-bearing for the quantitative over-revelation claim (1.8-4.2x).
- [Methods (Prompt elicitation)] Methods (Prompt elicitation): The central comparison treats the model's stated numerical value as the literal cheap-talk signal. While the decoder ablation supports recoverability from the number alone, the manuscript does not supply the exact prompt templates used across the three frames, which is required to confirm the elicitation produces a single usable scalar rather than hedging.
minor comments (2)
- The manuscript should include a link to the full prompt templates and analysis code to enable independent reproduction of the 12,000-trial design and oracle NMI calculations.
- Notation for normalized mutual information (NMI) should be defined on first use in the main text, with a brief reminder of its range and interpretation relative to the oracle partitions.
Simulated Author's Rebuttal
We thank the referee for the constructive report and the recommendation for minor revision. The two major comments identify points where additional transparency will strengthen the manuscript; we address each below and will revise accordingly.
read point-by-point responses
-
Referee: [§3 (Oracle computation)] The manuscript states the exact partition sizes and NMI values for each b but provides no derivation, formula reference, or verification step showing these match the Crawford-Sobel equilibrium conditions (e.g., the indifference condition at partition boundaries for the given bias). This verification is load-bearing for the quantitative over-revelation claim (1.8-4.2x).
Authors: We agree that explicit verification of the oracle equilibria is required. In the revised manuscript we will expand §3 with a dedicated derivation subsection that (i) recalls the Crawford-Sobel indifference condition at each partition boundary, (ii) states the closed-form expressions for the most-informative partition sizes and the resulting normalized mutual information for the four bias values, and (iii) supplies the numerical verification steps that produce the reported oracle NMI values 0.5294/0.3268/0.2205/0.1829. A reference to Crawford & Sobel (1982) will be added. revision: yes
-
Referee: [Methods (Prompt elicitation)] The central comparison treats the model's stated numerical value as the literal cheap-talk signal. While the decoder ablation supports recoverability from the number alone, the manuscript does not supply the exact prompt templates used across the three frames, which is required to confirm the elicitation produces a single usable scalar rather than hedging.
Authors: We accept that the exact prompt wording must be provided for reproducibility. The revised Methods section (or a new appendix) will contain the verbatim templates for all three frames. Each template is written to elicit a single numerical message; the decoder ablation already shows that only the numerical token sequence carries the information used by the receiver. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central comparison compares LLM-generated messages (via NMI with true states) against independently pre-computed oracle values from the standard Crawford-Sobel cheap-talk equilibrium for given bias levels b. These oracle partition sizes (7,4,3,2) and NMI figures (0.5294 etc.) are derived from the external theoretical model and stated as fixed inputs to the experimental design; no parameters are estimated from the LLM outputs themselves. The decoder ablation further isolates that the measured signal is the explicit numerical message. No self-citation chain, fitted-input-as-prediction, or self-definitional step appears in the derivation; the empirical result is therefore falsifiable against the external benchmark and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Crawford-Sobel cheap-talk setup with costless messages and quadratic payoffs accurately models the LLM advisory interaction when bias b is injected via prompt.
Forward citations
Cited by 1 Pith paper
-
Do Large Language Model Voters Strategize? An Oracle-Based Benchmark for Manipulation under Voting Rules
Introduces an oracle benchmark supplying exact ground truth on LLM strategic manipulation rates across five voting rules using 600 election instances.
Reference graph
Works this paper leans on
-
[1]
J., Bethge, M., and Schulz, E
Akata, E., Schulz, L., Coda-Forno, J., Oh, S. J., Bethge, M., and Schulz, E. (2025). Playing repeated games with large language models.Nature Human Behaviour, 9:1380–1390
2025
-
[2]
Amiri-Margavi, A., Gharagozlou, A., Gholami Davodi, A., Mousavi Davoudi, S. P., and Hasani Balyani, H. (2026). Equal access, unequal interaction: A counterfactual audit of LLM fairness. arXiv preprint arXiv:2602.02932
- [3]
-
[4]
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
K., and Lim, W
Blume, A., Lai, E. K., and Lim, W. (2020). Strategic information transmission: a survey of experiments and theoretical foundations. InHandbook of Experimental Game Theory, pp. 311–347
2020
-
[7]
and Wang, J
Cai, H. and Wang, J. T.-Y. (2006). Overcommunication in strategic information transmission games.Games and Economic Behavior, 56(1):7–36. 18
2006
-
[8]
Chen, Y., Kartik, N., and Sobel, J. (2008). Selecting cheap-talk equilibria.Econometrica, 76(1):117–136
2008
-
[9]
Condorelli, D. and Furlan, M. (2024). Cheap talking algorithms.arXiv preprint arXiv:2310.07867
-
[10]
Gholami Davoodi, A., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2025). LLMs are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of LLMs.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). arXiv:2406.05194
-
[11]
Gholami Davoodi, A., Rezazadeh, N., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2026). Geometry-aware decoding with Wasserstein-regularized truncation and mass penalties for large language models.arXiv preprint arXiv:2602.10346
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Amiri-Margavi, A., Jebellat, I., Jebellat, E., and Mousavi Davoudi, S. P. (2025). Enhancing answer reliability through inter-model consensus of large language models.Artificial Intelligence Applications and Innovations (AIAI 2025), IFIP Advances in Information and Communication Technology, vol. 758. arXiv:2411.16797
-
[13]
P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M
Mousavi Davoudi, S. P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M. (2025). Collective reasoning among LLMs: A framework for answer validation without ground truth. arXiv preprint arXiv:2502.20758
-
[14]
Crawford, V. P. and Sobel, J. (1982). Strategic information transmission.Econometrica, 50(6):1431–1451
1982
-
[15]
Crawford, V. P. (1998). A survey of experiments on communication via cheap talk.Journal of Economic Theory, 78(2):286–298
1998
-
[16]
Fan, C., Chen, J., Jin, Y., and He, H. (2024). Can large language models serve as rational players in game theory? A systematic analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17960–17967
2024
-
[17]
and Rabin, M
Farrell, J. and Rabin, M. (1996). Cheap talk.Journal of Economic Perspectives, 10(3):103–118
1996
-
[18]
J., Filippas, A., and Manning, B
Horton, J. J., Filippas, A., and Manning, B. S. (2023, revised 2026). Large language models as simulated economic agents: What can we learn from Homo Silicus?NBER Working Paper No. 31122
2023
-
[19]
Lin, S., Hilton, J., and Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 3214–3252
2022
-
[20]
Sharma, M., Tong, M., Korbak, T., et al. (2024). Towards understanding sycophancy in language models.International Conference on Learning Representations (ICLR)
2024
-
[21]
and Takizawa, H
Kawagoe, T. and Takizawa, H. (2009). Equilibrium refinement vs. level-k analysis: An experimental study of cheap-talk games with private information.Games and Economic Behavior, 66(1):238–255
2009
-
[22]
and Heydari, B
Lorè, N. and Heydari, B. (2024). Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14:18490. 19
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.