Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

Alireza Amiri-Margavi; Amin Gholami Davodi; Arshia Gharagozlou; Hamidreza Hasani Balyani; Seyed Pouyan Mousavi Davoudi

arxiv: 2606.01456 · v1 · pith:6EFIWB6Lnew · submitted 2026-05-31 · 💻 cs.LG · cs.CL· cs.GT

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

Hamidreza Hasani Balyani , Seyed Pouyan Mousavi Davoudi , Alireza Amiri-Margavi , Amin Gholami Davodi , Arshia Gharagozlou This is my paper

Pith reviewed 2026-06-28 17:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.GT

keywords LLM honestycheap talkpreference misalignmentinformation revelationAI alignment benchmarksender-receiver game

0 comments

The pith

Large language models over-reveal private information in advisor roles by 1.8 to 4.2 times the level predicted by cheap-talk theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the Crawford-Sobel sender-receiver game into a fixed benchmark that measures how much information an LLM advisor transmits when its preferred action differs from the user's by a known bias. Theory predicts that a rational sender will use coarse monotone partitions whose size shrinks as bias grows, achieving only modest normalized mutual information between message and state. The design runs four models across five bias values, three prompt frames, and two hundred states each, producing twelve thousand messages. All four models instead transmit near-continuous information with a constant upward offset that tracks their bias, never settling into the predicted partitions. The gap persists under both payoff-maximizing and honesty prompts and disappears only when the receiver is denied the explicit numerical claim in the message.

Core claim

When the sender's ideal point is shifted from the receiver's by bias b, the four tested models produce messages whose normalized mutual information with the hidden state remains between 0.78 and 0.94 for every b in the tested grid, while the most-informative equilibrium requires values between 0.18 and 0.53; the models therefore transmit between 1.8 and 4.2 times more information than the strategic optimum allows, manifesting as near-full revelation accompanied by linear exaggeration rather than the coarse partitions theory requires.

What carries the argument

The Crawford-Sobel cheap-talk sender-receiver game, in which a sender observes a state omega in [0,1] and sends one costless message to a receiver whose action is chosen to match omega while the sender prefers action omega plus bias b.

If this is right

Informativeness declines with larger bias but plateaus well above the equilibrium partition size.
Switching the prompt frame from payoff maximization to explicit honesty instructions produces no measurable change in revelation level.
The over-revelation pattern is recoverable only when the receiver is given the model's stated number; embedding-only decoding collapses the signal.
Linear exaggeration of the state, rather than interval partitioning, is the dominant observed strategy across all models and bias levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment evaluations that rely on stated preferences alone may miss the systematic upward bias these models introduce when transmitting numbers.
The same benchmark design can be reused with different state spaces or multi-round interactions to test whether repeated play induces the coarse partitions theory expects.
If downstream users treat the model's stated number as ground truth, the documented exaggeration will systematically shift their actions away from their own optimum by a fixed fraction of the bias.

Load-bearing premise

That the single numerical value a model writes in its message can be read by a rational receiver as the literal signal whose information content is being measured.

What would settle it

An experiment in which the same messages are fed to a receiver that sees only vector embeddings of the text and never the explicit number the model wrote; the ablation already shows this decoder recovers only near-babbling performance.

read the original abstract

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper converts the Crawford-Sobel model into a pre-specified LLM benchmark and shows consistent over-revelation across four models.

read the letter

The main thing to know is that this work takes the classic cheap-talk setup with bias b and turns it into a concrete test: five bias levels, exact oracle partition sizes and NMI values computed from the model, then 12,000 trials on GPT-4o, Claude, Gemini, and Llama. All four models produce NMI of 0.78-0.94 where the most-informative equilibrium predicts 0.18-0.53, with a linear exaggeration pattern instead of the expected coarse partitions.

What is actually new is the explicit oracle construction for the chosen bias grid and the controlled comparison that stays parameter-free. The design is pre-specified, uses three prompt frames, fixed temperature, and includes a decoder ablation that shows the result depends on the stated number rather than embeddings. That ablation is useful evidence.

The soft spots are limited. The assumption that the numerical output functions as the cheap-talk signal is tested by the ablation, though it still requires the prompt to elicit a single usable number rather than hedging. Exact prompt text and code are not in the abstract, which makes immediate reproduction harder, but the reported numbers line up internally and the direction of the bias effect matches theory.

This is for people building or evaluating honesty benchmarks for misaligned advisors. Readers who want a game-theoretic baseline with reproducible oracles will get direct value; the quantitative mapping is the useful part.

It deserves a serious referee. The central claim rests on an independent equilibrium calculation rather than post-hoc fitting, and the controls are reasonable. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The manuscript adapts the Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Using 5 bias levels (b=0.01 to 0.12), 3 prompt frames, fixed low temperature, and 200 states per cell (12,000 total trials), it compares observed normalized mutual information (NMI) from four instruction-tuned models against independently computed oracle values for the most-informative equilibria (partition sizes 7/4/3/2, oracle NMI 0.5294/0.3268/0.2205/0.1829). The central finding is that all models over-reveal (NMI 0.78-0.94) with a linear-exaggeration pattern rather than coarse partitions, and that framing has negligible effect; a decoder ablation confirms the result depends on the stated numerical message.

Significance. If the results hold, the work supplies a theory-grounded, pre-specified benchmark with independently derived oracle baselines and explicit controls (multiple frames, bias levels, decoder ablation). This is a concrete strength for alignment evaluation, as the comparison rests on no parameters fitted to LLM outputs and yields a falsifiable prediction (over-revelation with constant offset). The design allows direct testing of whether future models approach the strategic optimum.

major comments (2)

[§3 (Oracle computation)] §3 (Oracle computation): The manuscript states the exact partition sizes and NMI values for each b but provides no derivation, formula reference, or verification step showing these match the Crawford-Sobel equilibrium conditions (e.g., the indifference condition at partition boundaries for the given bias). This verification is load-bearing for the quantitative over-revelation claim (1.8-4.2x).
[Methods (Prompt elicitation)] Methods (Prompt elicitation): The central comparison treats the model's stated numerical value as the literal cheap-talk signal. While the decoder ablation supports recoverability from the number alone, the manuscript does not supply the exact prompt templates used across the three frames, which is required to confirm the elicitation produces a single usable scalar rather than hedging.

minor comments (2)

The manuscript should include a link to the full prompt templates and analysis code to enable independent reproduction of the 12,000-trial design and oracle NMI calculations.
Notation for normalized mutual information (NMI) should be defined on first use in the main text, with a brief reminder of its range and interpretation relative to the oracle partitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and the recommendation for minor revision. The two major comments identify points where additional transparency will strengthen the manuscript; we address each below and will revise accordingly.

read point-by-point responses

Referee: [§3 (Oracle computation)] The manuscript states the exact partition sizes and NMI values for each b but provides no derivation, formula reference, or verification step showing these match the Crawford-Sobel equilibrium conditions (e.g., the indifference condition at partition boundaries for the given bias). This verification is load-bearing for the quantitative over-revelation claim (1.8-4.2x).

Authors: We agree that explicit verification of the oracle equilibria is required. In the revised manuscript we will expand §3 with a dedicated derivation subsection that (i) recalls the Crawford-Sobel indifference condition at each partition boundary, (ii) states the closed-form expressions for the most-informative partition sizes and the resulting normalized mutual information for the four bias values, and (iii) supplies the numerical verification steps that produce the reported oracle NMI values 0.5294/0.3268/0.2205/0.1829. A reference to Crawford & Sobel (1982) will be added. revision: yes
Referee: [Methods (Prompt elicitation)] The central comparison treats the model's stated numerical value as the literal cheap-talk signal. While the decoder ablation supports recoverability from the number alone, the manuscript does not supply the exact prompt templates used across the three frames, which is required to confirm the elicitation produces a single usable scalar rather than hedging.

Authors: We accept that the exact prompt wording must be provided for reproducibility. The revised Methods section (or a new appendix) will contain the verbatim templates for all three frames. Each template is written to elicit a single numerical message; the decoder ablation already shows that only the numerical token sequence carries the information used by the receiver. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central comparison compares LLM-generated messages (via NMI with true states) against independently pre-computed oracle values from the standard Crawford-Sobel cheap-talk equilibrium for given bias levels b. These oracle partition sizes (7,4,3,2) and NMI figures (0.5294 etc.) are derived from the external theoretical model and stated as fixed inputs to the experimental design; no parameters are estimated from the LLM outputs themselves. The decoder ablation further isolates that the measured signal is the explicit numerical message. No self-citation chain, fitted-input-as-prediction, or self-definitional step appears in the derivation; the empirical result is therefore falsifiable against the external benchmark and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the validity of mapping LLM text outputs to messages in the Crawford-Sobel sender-receiver game and on the correctness of the pre-computed most-informative partitions for each bias level.

axioms (1)

domain assumption The Crawford-Sobel cheap-talk setup with costless messages and quadratic payoffs accurately models the LLM advisory interaction when bias b is injected via prompt.
Invoked throughout the design to justify the oracle partition sizes and NMI targets.

pith-pipeline@v0.9.1-grok · 5962 in / 1375 out tokens · 34852 ms · 2026-06-28T17:10:26.986189+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Large Language Model Voters Strategize? An Oracle-Based Benchmark for Manipulation under Voting Rules
cs.GT 2026-06 unverdicted novelty 7.0

Introduces an oracle benchmark supplying exact ground truth on LLM strategic manipulation rates across five voting rules using 600 election instances.

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

J., Bethge, M., and Schulz, E

Akata, E., Schulz, L., Coda-Forno, J., Oh, S. J., Bethge, M., and Schulz, E. (2025). Playing repeated games with large language models.Nature Human Behaviour, 9:1380–1390

2025
[2]

P., and Hasani Balyani, H

Amiri-Margavi, A., Gharagozlou, A., Gholami Davodi, A., Mousavi Davoudi, S. P., and Hasani Balyani, H. (2026). Equal access, unequal interaction: A counterfactual audit of LLM fairness. arXiv preprint arXiv:2602.02932

work page arXiv 2026
[3]

Babichenko, Y., Talgam-Cohen, I., Xu, H., and Zabarnyi, K. (2024). Algorithmic cheap talk. Proceedings of the 25th ACM Conference on Economics and Computation (EC ’24). arXiv:2311.09011

work page arXiv 2024
[4]

Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

K., and Lim, W

Blume, A., Lai, E. K., and Lim, W. (2020). Strategic information transmission: a survey of experiments and theoretical foundations. InHandbook of Experimental Game Theory, pp. 311–347

2020
[7]

and Wang, J

Cai, H. and Wang, J. T.-Y. (2006). Overcommunication in strategic information transmission games.Games and Economic Behavior, 56(1):7–36. 18

2006
[8]

Chen, Y., Kartik, N., and Sobel, J. (2008). Selecting cheap-talk equilibria.Econometrica, 76(1):117–136

2008
[9]

and Furlan, M

Condorelli, D. and Furlan, M. (2024). Cheap talking algorithms.arXiv preprint arXiv:2310.07867

work page arXiv 2024
[10]

P., and Pezeshkpour, P

Gholami Davoodi, A., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2025). LLMs are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of LLMs.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). arXiv:2406.05194

work page arXiv 2025
[11]

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Gholami Davoodi, A., Rezazadeh, N., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2026). Geometry-aware decoding with Wasserstein-regularized truncation and mass penalties for large language models.arXiv preprint arXiv:2602.10346

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Amiri-Margavi, A., Jebellat, I., Jebellat, E., and Mousavi Davoudi, S. P. (2025). Enhancing answer reliability through inter-model consensus of large language models.Artificial Intelligence Applications and Innovations (AIAI 2025), IFIP Advances in Information and Communication Technology, vol. 758. arXiv:2411.16797

work page arXiv 2025
[13]

P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M

Mousavi Davoudi, S. P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M. (2025). Collective reasoning among LLMs: A framework for answer validation without ground truth. arXiv preprint arXiv:2502.20758

work page arXiv 2025
[14]

Crawford, V. P. and Sobel, J. (1982). Strategic information transmission.Econometrica, 50(6):1431–1451

1982
[15]

Crawford, V. P. (1998). A survey of experiments on communication via cheap talk.Journal of Economic Theory, 78(2):286–298

1998
[16]

Fan, C., Chen, J., Jin, Y., and He, H. (2024). Can large language models serve as rational players in game theory? A systematic analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17960–17967

2024
[17]

and Rabin, M

Farrell, J. and Rabin, M. (1996). Cheap talk.Journal of Economic Perspectives, 10(3):103–118

1996
[18]

J., Filippas, A., and Manning, B

Horton, J. J., Filippas, A., and Manning, B. S. (2023, revised 2026). Large language models as simulated economic agents: What can we learn from Homo Silicus?NBER Working Paper No. 31122

2023
[19]

Lin, S., Hilton, J., and Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 3214–3252

2022
[20]

Sharma, M., Tong, M., Korbak, T., et al. (2024). Towards understanding sycophancy in language models.International Conference on Learning Representations (ICLR)

2024
[21]

and Takizawa, H

Kawagoe, T. and Takizawa, H. (2009). Equilibrium refinement vs. level-k analysis: An experimental study of cheap-talk games with private information.Games and Economic Behavior, 66(1):238–255

2009
[22]

and Heydari, B

Lorè, N. and Heydari, B. (2024). Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14:18490. 19

2024

[1] [1]

J., Bethge, M., and Schulz, E

Akata, E., Schulz, L., Coda-Forno, J., Oh, S. J., Bethge, M., and Schulz, E. (2025). Playing repeated games with large language models.Nature Human Behaviour, 9:1380–1390

2025

[2] [2]

P., and Hasani Balyani, H

Amiri-Margavi, A., Gharagozlou, A., Gholami Davodi, A., Mousavi Davoudi, S. P., and Hasani Balyani, H. (2026). Equal access, unequal interaction: A counterfactual audit of LLM fairness. arXiv preprint arXiv:2602.02932

work page arXiv 2026

[3] [3]

Babichenko, Y., Talgam-Cohen, I., Xu, H., and Zabarnyi, K. (2024). Algorithmic cheap talk. Proceedings of the 25th ACM Conference on Economics and Computation (EC ’24). arXiv:2311.09011

work page arXiv 2024

[4] [4]

Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

K., and Lim, W

Blume, A., Lai, E. K., and Lim, W. (2020). Strategic information transmission: a survey of experiments and theoretical foundations. InHandbook of Experimental Game Theory, pp. 311–347

2020

[7] [7]

and Wang, J

Cai, H. and Wang, J. T.-Y. (2006). Overcommunication in strategic information transmission games.Games and Economic Behavior, 56(1):7–36. 18

2006

[8] [8]

Chen, Y., Kartik, N., and Sobel, J. (2008). Selecting cheap-talk equilibria.Econometrica, 76(1):117–136

2008

[9] [9]

and Furlan, M

Condorelli, D. and Furlan, M. (2024). Cheap talking algorithms.arXiv preprint arXiv:2310.07867

work page arXiv 2024

[10] [10]

P., and Pezeshkpour, P

Gholami Davoodi, A., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2025). LLMs are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of LLMs.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). arXiv:2406.05194

work page arXiv 2025

[11] [11]

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Gholami Davoodi, A., Rezazadeh, N., Mousavi Davoudi, S. P., and Pezeshkpour, P. (2026). Geometry-aware decoding with Wasserstein-regularized truncation and mass penalties for large language models.arXiv preprint arXiv:2602.10346

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Amiri-Margavi, A., Jebellat, I., Jebellat, E., and Mousavi Davoudi, S. P. (2025). Enhancing answer reliability through inter-model consensus of large language models.Artificial Intelligence Applications and Innovations (AIAI 2025), IFIP Advances in Information and Communication Technology, vol. 758. arXiv:2411.16797

work page arXiv 2025

[13] [13]

P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M

Mousavi Davoudi, S. P., Gholami Davodi, A., Amiri-Margavi, A., and Jafari, M. (2025). Collective reasoning among LLMs: A framework for answer validation without ground truth. arXiv preprint arXiv:2502.20758

work page arXiv 2025

[14] [14]

Crawford, V. P. and Sobel, J. (1982). Strategic information transmission.Econometrica, 50(6):1431–1451

1982

[15] [15]

Crawford, V. P. (1998). A survey of experiments on communication via cheap talk.Journal of Economic Theory, 78(2):286–298

1998

[16] [16]

Fan, C., Chen, J., Jin, Y., and He, H. (2024). Can large language models serve as rational players in game theory? A systematic analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17960–17967

2024

[17] [17]

and Rabin, M

Farrell, J. and Rabin, M. (1996). Cheap talk.Journal of Economic Perspectives, 10(3):103–118

1996

[18] [18]

J., Filippas, A., and Manning, B

Horton, J. J., Filippas, A., and Manning, B. S. (2023, revised 2026). Large language models as simulated economic agents: What can we learn from Homo Silicus?NBER Working Paper No. 31122

2023

[19] [19]

Lin, S., Hilton, J., and Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 3214–3252

2022

[20] [20]

Sharma, M., Tong, M., Korbak, T., et al. (2024). Towards understanding sycophancy in language models.International Conference on Learning Representations (ICLR)

2024

[21] [21]

and Takizawa, H

Kawagoe, T. and Takizawa, H. (2009). Equilibrium refinement vs. level-k analysis: An experimental study of cheap-talk games with private information.Games and Economic Behavior, 66(1):238–255

2009

[22] [22]

and Heydari, B

Lorè, N. and Heydari, B. (2024). Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14:18490. 19

2024