Recognition: unknown
Persuadability and LLMs as Legal Decision Tools
Pith reviewed 2026-05-07 13:31 UTC · model grok-4.3
The pith
Frontier LLMs agree more often with legal positions when those positions are advanced by higher-quality advocates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier open- and closed-weights LLMs respond to legal arguments in ways that depend on the quality of the advocate presenting them, with higher-quality advocacy increasing the likelihood that the model agrees with the advocated position. This pattern holds across the tested models and raises direct questions about whether LLMs can serve as neutral or merit-based decision-makers in contested legal settings.
What carries the argument
Controlled prompt experiments that vary advocate quality while holding the legal claim constant, then measuring the resulting change in model agreement rates.
If this is right
- LLM outputs in legal contexts may track the skill of the advocate more than the underlying legal merits.
- Using LLMs as first-instance decision-makers could embed presentation bias into administrative or judicial results.
- Safeguards or calibration steps would be needed before LLMs could be trusted for contested legal questions.
- The same models might produce inconsistent outcomes across equivalent cases simply because one side presented its case more effectively.
Where Pith is reading between the lines
- The same experimental design could be applied to other high-stakes domains where AI is asked to weigh competing arguments, such as regulatory or policy decisions.
- Models could be tested for whether fine-tuning or prompt engineering can reduce sensitivity to advocate framing without losing reasoning capability.
- Direct comparisons between LLM persuadability scores and human judge or administrator scores on matched cases would show whether the effect is unique to current AI systems.
Load-bearing premise
That differences in how the experimental prompts describe or style the advocate accurately reflect real-world differences in legal advocacy quality.
What would settle it
If the same set of legal questions were run across many more models and prompt variations and no reliable difference in agreement rates appeared between high- and low-quality advocate versions, the observed persuadability effect would be undermined.
read the original abstract
As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that frontier open- and closed-weight LLMs exhibit persuadability when responding to legal arguments, with experimental results showing that the quality of the advocate presenting those arguments affects the likelihood a model will agree with a given legal position. It explores factors driving these outcomes and discusses implications for deploying LLMs as legal decision assistants or first-instance decision-makers.
Significance. If the results are robust, the work is significant for highlighting risks in using LLMs for legal and administrative decisions, particularly the potential for models to be swayed by advocate presentation rather than case merits. The inclusion of both open- and closed-weight frontier models provides a useful comparative lens. The experimental focus on a core legal feature (responsiveness to contending arguments) is a clear strength.
major comments (2)
- [Abstract and §3] Abstract and §3 (Methods): The abstract states that original experiments examine how advocate quality affects model agreement, yet provides no description of how quality was operationalized, what legal questions were used, sample sizes, or statistical controls. This information is load-bearing for evaluating whether observed agreement shifts reflect genuine persuadability.
- [§3] §3 (Experimental Design): The prompt-based manipulation of advocate quality does not appear to include explicit controls ensuring argument substance remains fixed while only altering presentation style. Without such controls, shifts in agreement rates risk reflecting LLM sensitivity to phrasing patterns or detectable prompt features rather than engagement with legal merits, undermining the central claim.
minor comments (2)
- [§4] §4 (Results): Clarify whether agreement rates are reported with confidence intervals or effect sizes; raw percentages alone make it difficult to judge practical significance.
- [§5] §5 (Discussion): The implications section could more explicitly address how the findings generalize beyond the specific legal questions tested.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment below and indicate where revisions will be incorporated to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Methods): The abstract states that original experiments examine how advocate quality affects model agreement, yet provides no description of how quality was operationalized, what legal questions were used, sample sizes, or statistical controls. This information is load-bearing for evaluating whether observed agreement shifts reflect genuine persuadability.
Authors: We agree that the abstract would be strengthened by a concise summary of key experimental parameters to allow readers to more readily evaluate the results. Section 3 of the manuscript already details the operationalization of advocate quality via prompt-based descriptors of expertise and experience, the standardized legal hypotheticals employed, the trial counts per condition, and the use of regression-based statistical controls. To directly address the referee's concern, we will revise the abstract to include a high-level description of these elements (e.g., prompt manipulation of advocate descriptors on legal hypotheticals with appropriate sample sizes and controls). This will improve accessibility without altering the manuscript's core claims. revision: yes
-
Referee: [§3] §3 (Experimental Design): The prompt-based manipulation of advocate quality does not appear to include explicit controls ensuring argument substance remains fixed while only altering presentation style. Without such controls, shifts in agreement rates risk reflecting LLM sensitivity to phrasing patterns or detectable prompt features rather than engagement with legal merits, undermining the central claim.
Authors: We appreciate the referee's emphasis on isolating the effect of perceived advocate quality. In the reported experiments, the substantive legal arguments were held constant across conditions, with variation limited to the advocate quality descriptor appended to the prompt; the argument text itself remained identical. This design choice was made to test persuadability based on advocate presentation rather than argument content. To further mitigate concerns about prompt sensitivity or detectable features, we will revise §3 to provide a more explicit statement of these controls and include additional robustness checks (such as style-only ablations) in the revised manuscript to confirm that agreement shifts track the quality manipulation. revision: partial
Circularity Check
No circularity: purely descriptive experimental reporting
full rationale
The paper reports original experiments on how LLMs respond to legal arguments varying by advocate quality, with no equations, parameter fitting, predictions derived from inputs, or self-citations that bear the central claim. Claims rest on empirical prompt-based observations rather than any derivation that reduces to its own definitions or fitted values by construction. This matches the default expectation for non-circular experimental work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be used as proxies for legal decision-makers in experimental settings
Reference graph
Works this paper leans on
-
[1]
Guilherme F.C.F. Almeida et al. 2024. Exploring the Psychology of LLMs’ Moral and Legal Reasoning.Artificial Intelligence333 (Aug. 2024), 104145. doi:10.1016/ j.artint.2024.104145
-
[2]
Amalia Amaya. 2025. Reasoning in Character: Virtue, Legal Argumentation, and Judicial Ethics.Ethic Theory Moral Prac28, 3 (July 2025), 359–378. doi:10.1007/ s10677-023-10414-z
2025
-
[3]
Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme. 2023. Can GPT-3 Perform Statutory Reasoning?. InProceedings of the Nineteenth Interna- tional Conference on Artificial Intelligence and Law (ICAIL ’23). Association for Computing Machinery, New York, NY, USA, 22–31. doi:10.1145/3594536.3595163
-
[4]
Andrew Blair-Stanek and Benjamin Van Durme. 2026. LLMs Provide Unstable Answers to Legal Questions. InProceedings of the Twentieth International Con- ference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 425–429. doi:10.1145/3769126.3769245
-
[5]
Simon Martin Breum et al. 2024. The Persuasive Power of Large Language Models. ICWSM18 (May 2024), 152–163. doi:10.1609/icwsm.v18i1.31304
- [6]
-
[7]
Andrew Coan and Harry Surden. 2025. Artificial Intelligence and Constitutional Interpretation.U. Colo. L. Rev.96, 2 (2025), 413–498
2025
- [8]
-
[9]
Matthew Dahl et al. 2024. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.Journal of Legal Analysis16, 1 (Jan. 2024), 64–93. doi:10.1093/jla/laae003
-
[10]
2024-04-09, 2024
Esin Durmus et al. 2024-04-09, 2024. Measuring the Persuasiveness of Language Models. https://www.anthropic.com/news/measuring-model-persuasiveness
2024
-
[11]
John Gardner. 2001. Legal Positivism: 5 1/2 Myths.Am. J. Juris.46 (2001), 199
2001
-
[12]
Kobi Hackenburg and Helen Margetts. 2024. Evaluating the Persuasive Influence of Political Microtargeting with Large Language Models.Proc. Natl. Acad. Sci. U.S.A.121, 24 (June 2024), e2403116121. doi:10.1073/pnas.2403116121
-
[13]
Mateusz Idziejczak et al. 2025. Among Them: A Game-Based Framework for Assessing Persuasion Capabilities of LLMs. InAdvances in Knowledge Discov- ery and Data Mining, Xintao Wu, Myra Spiliopoulou, Can Wang, Vipin Kumar, Longbing Cao, Yanqiu Wu, Yu Yao, and Zhangkai Wu (Eds.). Vol. 15874. Springer Nature Singapore, Singapore, 183–195. doi:10.1007/978-981-9...
-
[14]
Tianjie Ju et al. 2025. On the Adaptive Psychological Persuasion of Large Lan- guage Models. doi:10.48550/arXiv.2506.06800
-
[15]
Shirish Karande, Santhosh V, and Yash Bhatia. 2024. Persuasion Games with Large Language Models. InProceedings of the 21st International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), Chennai, India, 576–582. https://aclanthology.org/2024.icon-1.67/
2024
-
[16]
John M Kelly. 1964. Audi Alteram Partem;Note.NATURAL LA W FORUM(1964)
1964
-
[17]
Jinqi Lai et al. 2024. Large Language Models in Law: A Survey.AI Open5 (2024), 181–196. doi:10.1016/j.aiopen.2024.09.002
-
[18]
José Luiz Nunes, Guilherme Almeida, and Brian Flanagan. 2025. Evidence of Conceptual Mastery in the Application of Rules by Large Language Models. doi:10.2139/ssrn.5161877
-
[19]
OpenAI. 2024. OpenAI O1 System Card. doi:10.48550/arXiv.2412.16720
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
-
[20]
Paulina Jo Pesch. 2025. Potentials and Challenges of Large Language Models (LLMs) in the Context of Administrative Decision-Making.Eur. j. risk regul.16, 1 (March 2025), 76–95. doi:10.1017/err.2024.99
-
[21]
Eric A. Posner and Shivam Saran. 2025. Judge AI: Assessing Large Language Models in Judicial Decision-Making. social science research network:5098708 doi:10.2139/ssrn.5098708
-
[22]
Alexander Rogiers et al. 2024. Persuasion with Large Language Models: A Survey. arXiv:2411.06837 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [23]
-
[24]
Lawrence B. Solum. 2003. Virtue Jurisprudence A Virtue–Centred Theory of Judging.Metaphilosophy34, 1-2 (Jan. 2003), 178–213. doi:10.1111/1467-9973.00268
-
[25]
Bryan Chen Zhengyu Tan et al. 2025. Persuasion Dynamics in LLMs: Investi- gating Robustness and Adaptability in Knowledge and Safety with DuET-PD. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 1550–1575. doi:10.18653/v1/2025.emnlp-main.81
-
[26]
Cassandra Teigen et al. 2024. Persuasiveness of Arguments with AI-source Labels. Proceedings of the Annual Meeting of the Cognitive Science Society46, 0 (2024). https://escholarship.org/uc/item/6t82g70v
2024
-
[27]
Elizaveta Tennant, Stephen Hailes, and Mirco Musolesi. 2025. Moral Align- ment for LLM Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=MeGDmZjUXy
2025
-
[28]
Jeremy Waldron. 2023. The Rule of Law. InThe Stanford Encyclopedia of Phi- losophy(fall 2023 ed.). Metaphysics Research Lab, Stanford University. https: //plato.stanford.edu/archives/fall2023/entries/rule-of-law/
2023
-
[29]
Yi Zeng et al. 2024. How Johnny Can Persuade LLMs to Jailbreak Them: Rethink- ing Persuasion to Challenge AI Safety by Humanizing LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 14322–14350. doi:10.18653/v1/2024.acl-long.773
-
[30]
Kepu Zhang et al . 2025. SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. ACM, Seoul Republic of Korea, 4117–4127. doi:10.1145/3746252.3761120
- [31]
-
[32]
Xiaochen Zhu et al. 2025. Conformity in Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 3854–
2025
-
[33]
doi:10.18653/v1/2025.acl-long.195
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.