arxiv: 2604.26233 · v1 · submitted 2026-04-29 · 💻 cs.AI · cs.CY

Recognition: unknown

Persuadability and LLMs as Legal Decision Tools

Oisin Suttle , David Lillis

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords large language modelslegal decision makingpersuadabilityAI in lawadvocacyjudicial AI

0 comments

The pith

Frontier LLMs agree more often with legal positions when those positions are advanced by higher-quality advocates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can function as legal decision tools by measuring how much the way an argument is presented changes the model's willingness to accept a particular legal view. It runs controlled experiments on both open and closed frontier models, keeping the underlying legal point fixed while varying the apparent skill and style of the advocate making the case. A reader should care because legal systems rely on decision-makers who respond to the strength of the case rather than the polish of the presentation, and LLMs are already being considered for judicial and administrative roles. If models turn out to be sensitive to advocate quality, they risk producing outcomes that track presentation rather than merits. The work reports the measured agreement shifts and draws out consequences for deploying such systems.

Core claim

Frontier open- and closed-weights LLMs respond to legal arguments in ways that depend on the quality of the advocate presenting them, with higher-quality advocacy increasing the likelihood that the model agrees with the advocated position. This pattern holds across the tested models and raises direct questions about whether LLMs can serve as neutral or merit-based decision-makers in contested legal settings.

What carries the argument

Controlled prompt experiments that vary advocate quality while holding the legal claim constant, then measuring the resulting change in model agreement rates.

If this is right

LLM outputs in legal contexts may track the skill of the advocate more than the underlying legal merits.
Using LLMs as first-instance decision-makers could embed presentation bias into administrative or judicial results.
Safeguards or calibration steps would be needed before LLMs could be trusted for contested legal questions.
The same models might produce inconsistent outcomes across equivalent cases simply because one side presented its case more effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same experimental design could be applied to other high-stakes domains where AI is asked to weigh competing arguments, such as regulatory or policy decisions.
Models could be tested for whether fine-tuning or prompt engineering can reduce sensitivity to advocate framing without losing reasoning capability.
Direct comparisons between LLM persuadability scores and human judge or administrator scores on matched cases would show whether the effect is unique to current AI systems.

Load-bearing premise

That differences in how the experimental prompts describe or style the advocate accurately reflect real-world differences in legal advocacy quality.

What would settle it

If the same set of legal questions were run across many more models and prompt variations and no reliable difference in agreement rates appeared between high- and low-quality advocate versions, the observed persuadability effect would be undermined.

read the original abstract

As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs appear more likely to accept legal positions backed by higher-quality advocacy, but the experiments leave open whether this reflects real persuadability or prompt artifacts.

read the letter

LLMs appear more likely to accept legal positions backed by higher-quality advocacy, but the experiments leave open whether this reflects real persuadability or prompt artifacts. The paper takes existing findings on LLM sensitivity to argument framing and tests them in legal decision scenarios across open- and closed-weight frontier models. It reports that better-presented arguments increase agreement rates and explores some of the factors behind that shift. This is a straightforward extension rather than a new theoretical claim, but it lands on a practical issue: if these models are used for legal or administrative decisions, outcomes could track advocate skill instead of case merits. That warning is timely and worth stating clearly. The work does a decent job of connecting the dots to deployment risks without overclaiming. The soft spots sit in the experimental details. It is not clear how advocate quality was varied while holding substantive legal content fixed, or whether the manipulations altered reasoning strength, evidence, or just surface phrasing. Without reported sample sizes, controls for prompt length or lexical patterns, or checks against simple surface matching, the results could be driven by known LLM sensitivities to wording rather than engagement with legal merits. The stress-test point about confounding holds up on the available description. This paper is for people already working on LLM use in high-stakes domains like law or regulation. A reader tracking bias and persuasion research would pick up a targeted data point, but anyone needing reproducible methods or falsifiable claims would find it thin. I would send it for peer review. The core question matters enough to justify referee time, even though the current version would need clearer methods, statistics, and ablations before it could be published.

Referee Report

2 major / 2 minor

Summary. The paper claims that frontier open- and closed-weight LLMs exhibit persuadability when responding to legal arguments, with experimental results showing that the quality of the advocate presenting those arguments affects the likelihood a model will agree with a given legal position. It explores factors driving these outcomes and discusses implications for deploying LLMs as legal decision assistants or first-instance decision-makers.

Significance. If the results are robust, the work is significant for highlighting risks in using LLMs for legal and administrative decisions, particularly the potential for models to be swayed by advocate presentation rather than case merits. The inclusion of both open- and closed-weight frontier models provides a useful comparative lens. The experimental focus on a core legal feature (responsiveness to contending arguments) is a clear strength.

major comments (2)

[Abstract and §3] Abstract and §3 (Methods): The abstract states that original experiments examine how advocate quality affects model agreement, yet provides no description of how quality was operationalized, what legal questions were used, sample sizes, or statistical controls. This information is load-bearing for evaluating whether observed agreement shifts reflect genuine persuadability.
[§3] §3 (Experimental Design): The prompt-based manipulation of advocate quality does not appear to include explicit controls ensuring argument substance remains fixed while only altering presentation style. Without such controls, shifts in agreement rates risk reflecting LLM sensitivity to phrasing patterns or detectable prompt features rather than engagement with legal merits, undermining the central claim.

minor comments (2)

[§4] §4 (Results): Clarify whether agreement rates are reported with confidence intervals or effect sizes; raw percentages alone make it difficult to judge practical significance.
[§5] §5 (Discussion): The implications section could more explicitly address how the findings generalize beyond the specific legal questions tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment below and indicate where revisions will be incorporated to strengthen the paper.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The abstract states that original experiments examine how advocate quality affects model agreement, yet provides no description of how quality was operationalized, what legal questions were used, sample sizes, or statistical controls. This information is load-bearing for evaluating whether observed agreement shifts reflect genuine persuadability.

Authors: We agree that the abstract would be strengthened by a concise summary of key experimental parameters to allow readers to more readily evaluate the results. Section 3 of the manuscript already details the operationalization of advocate quality via prompt-based descriptors of expertise and experience, the standardized legal hypotheticals employed, the trial counts per condition, and the use of regression-based statistical controls. To directly address the referee's concern, we will revise the abstract to include a high-level description of these elements (e.g., prompt manipulation of advocate descriptors on legal hypotheticals with appropriate sample sizes and controls). This will improve accessibility without altering the manuscript's core claims. revision: yes
Referee: [§3] §3 (Experimental Design): The prompt-based manipulation of advocate quality does not appear to include explicit controls ensuring argument substance remains fixed while only altering presentation style. Without such controls, shifts in agreement rates risk reflecting LLM sensitivity to phrasing patterns or detectable prompt features rather than engagement with legal merits, undermining the central claim.

Authors: We appreciate the referee's emphasis on isolating the effect of perceived advocate quality. In the reported experiments, the substantive legal arguments were held constant across conditions, with variation limited to the advocate quality descriptor appended to the prompt; the argument text itself remained identical. This design choice was made to test persuadability based on advocate presentation rather than argument content. To further mitigate concerns about prompt sensitivity or detectable features, we will revise §3 to provide a more explicit statement of these controls and include additional robustness checks (such as style-only ablations) in the revised manuscript to confirm that agreement shifts track the quality manipulation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive experimental reporting

full rationale

The paper reports original experiments on how LLMs respond to legal arguments varying by advocate quality, with no equations, parameter fitting, predictions derived from inputs, or self-citations that bear the central claim. Claims rest on empirical prompt-based observations rather than any derivation that reduces to its own definitions or fitted values by construction. This matches the default expectation for non-circular experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can be meaningfully prompted to simulate legal decision-making and that advocate quality can be isolated as a variable in text prompts.

axioms (1)

domain assumption LLMs can be used as proxies for legal decision-makers in experimental settings
Invoked in the abstract's framing of LLMs as decision assistants or first-instance decision-makers.

pith-pipeline@v0.9.0 · 5481 in / 1214 out tokens · 53287 ms · 2026-05-07T13:31:08.313737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Almeida et al

Guilherme F.C.F. Almeida et al. 2024. Exploring the Psychology of LLMs’ Moral and Legal Reasoning.Artificial Intelligence333 (Aug. 2024), 104145. doi:10.1016/ j.artint.2024.104145

work page arXiv 2024
[2]

Amalia Amaya. 2025. Reasoning in Character: Virtue, Legal Argumentation, and Judicial Ethics.Ethic Theory Moral Prac28, 3 (July 2025), 359–378. doi:10.1007/ s10677-023-10414-z

2025
[3]

Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme. 2023. Can GPT-3 Perform Statutory Reasoning?. InProceedings of the Nineteenth Interna- tional Conference on Artificial Intelligence and Law (ICAIL ’23). Association for Computing Machinery, New York, NY, USA, 22–31. doi:10.1145/3594536.3595163

work page doi:10.1145/3594536.3595163 2023
[4]

Andrew Blair-Stanek and Benjamin Van Durme. 2026. LLMs Provide Unstable Answers to Legal Questions. InProceedings of the Twentieth International Con- ference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 425–429. doi:10.1145/3769126.3769245

work page doi:10.1145/3769126.3769245 2026
[5]

Simon Martin Breum et al. 2024. The Persuasive Power of Large Language Models. ICWSM18 (May 2024), 152–163. doi:10.1609/icwsm.v18i1.31304

work page doi:10.1609/icwsm.v18i1.31304 2024
[6]

Carlos Carrasco-Farre. 2024. Large Language Models Are as Persuasive as Hu- mans, but How? About the Cognitive Effort and Moral-Emotional Language of LLM Arguments. arXiv:2404.09329 [cs]

work page arXiv 2024
[7]

Andrew Coan and Harry Surden. 2025. Artificial Intelligence and Constitutional Interpretation.U. Colo. L. Rev.96, 2 (2025), 413–498

2025
[8]

Damian Curran et al. 2025. Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries. arXiv:2511.06700 [cs]

work page arXiv 2025
[9]

Matthew Dahl et al. 2024. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.Journal of Legal Analysis16, 1 (Jan. 2024), 64–93. doi:10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024
[10]

2024-04-09, 2024

Esin Durmus et al. 2024-04-09, 2024. Measuring the Persuasiveness of Language Models. https://www.anthropic.com/news/measuring-model-persuasiveness

2024
[11]

John Gardner. 2001. Legal Positivism: 5 1/2 Myths.Am. J. Juris.46 (2001), 199

2001
[12]

Kobi Hackenburg and Helen Margetts. 2024. Evaluating the Persuasive Influence of Political Microtargeting with Large Language Models.Proc. Natl. Acad. Sci. U.S.A.121, 24 (June 2024), e2403116121. doi:10.1073/pnas.2403116121

work page doi:10.1073/pnas.2403116121 2024
[13]

Mateusz Idziejczak et al. 2025. Among Them: A Game-Based Framework for Assessing Persuasion Capabilities of LLMs. InAdvances in Knowledge Discov- ery and Data Mining, Xintao Wu, Myra Spiliopoulou, Can Wang, Vipin Kumar, Longbing Cao, Yanqiu Wu, Yu Yao, and Zhangkai Wu (Eds.). Vol. 15874. Springer Nature Singapore, Singapore, 183–195. doi:10.1007/978-981-9...

work page doi:10.1007/978-981-96-8186-0_15 2025
[14]

Tianjie Ju et al. 2025. On the Adaptive Psychological Persuasion of Large Lan- guage Models. doi:10.48550/arXiv.2506.06800

work page doi:10.48550/arxiv.2506.06800 2025
[15]

Shirish Karande, Santhosh V, and Yash Bhatia. 2024. Persuasion Games with Large Language Models. InProceedings of the 21st International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), Chennai, India, 576–582. https://aclanthology.org/2024.icon-1.67/

2024
[16]

John M Kelly. 1964. Audi Alteram Partem;Note.NATURAL LA W FORUM(1964)

1964
[17]

Jinqi Lai et al. 2024. Large Language Models in Law: A Survey.AI Open5 (2024), 181–196. doi:10.1016/j.aiopen.2024.09.002

work page doi:10.1016/j.aiopen.2024.09.002 2024
[18]

José Luiz Nunes, Guilherme Almeida, and Brian Flanagan. 2025. Evidence of Conceptual Mastery in the Application of Rules by Large Language Models. doi:10.2139/ssrn.5161877

work page doi:10.2139/ssrn.5161877 2025
[19]

OpenAI. 2024. OpenAI O1 System Card. doi:10.48550/arXiv.2412.16720

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[20]

Paulina Jo Pesch. 2025. Potentials and Challenges of Large Language Models (LLMs) in the Context of Administrative Decision-Making.Eur. j. risk regul.16, 1 (March 2025), 76–95. doi:10.1017/err.2024.99

work page doi:10.1017/err.2024.99 2025
[21]

Posner and Shivam Saran

Eric A. Posner and Shivam Saran. 2025. Judge AI: Assessing Large Language Models in Judicial Decision-Making. social science research network:5098708 doi:10.2139/ssrn.5098708

work page doi:10.2139/ssrn.5098708 2025
[22]

Alexander Rogiers et al. 2024. Persuasion with Large Language Models: A Survey. arXiv:2411.06837 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Philipp Schoenegger et al. 2025. Large Language Models Are More Persuasive Than Incentivized Human Persuaders. arXiv:2505.09662 [cs]

work page arXiv 2025
[24]

Lawrence B. Solum. 2003. Virtue Jurisprudence A Virtue–Centred Theory of Judging.Metaphilosophy34, 1-2 (Jan. 2003), 178–213. doi:10.1111/1467-9973.00268

work page doi:10.1111/1467-9973.00268 2003
[25]

Bryan Chen Zhengyu Tan et al. 2025. Persuasion Dynamics in LLMs: Investi- gating Robustness and Adaptability in Knowledge and Safety with DuET-PD. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 1550–1575. doi:10.18653/v1/2025.emnlp-main.81

work page doi:10.18653/v1/2025.emnlp-main.81 2025
[26]

Cassandra Teigen et al. 2024. Persuasiveness of Arguments with AI-source Labels. Proceedings of the Annual Meeting of the Cognitive Science Society46, 0 (2024). https://escholarship.org/uc/item/6t82g70v

2024
[27]

Elizaveta Tennant, Stephen Hailes, and Mirco Musolesi. 2025. Moral Align- ment for LLM Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=MeGDmZjUXy

2025
[28]

Jeremy Waldron. 2023. The Rule of Law. InThe Stanford Encyclopedia of Phi- losophy(fall 2023 ed.). Metaphysics Research Lab, Stanford University. https: //plato.stanford.edu/archives/fall2023/entries/rule-of-law/

2023
[29]

Yi Zeng et al. 2024. How Johnny Can Persuade LLMs to Jailbreak Them: Rethink- ing Persuasion to Challenge AI Safety by Humanizing LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 14322–14350. doi:10.18653/v1/2024.acl-long.773

work page doi:10.18653/v1/2024.acl-long.773 2024
[30]

Kepu Zhang et al . 2025. SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. ACM, Seoul Republic of Korea, 4117–4127. doi:10.1145/3746252.3761120

work page doi:10.1145/3746252.3761120 2025
[31]

Haodong Zhao et al. 2025. Disagreements in Reasoning: How a Model’s Thinking Process Dictates Persuasion in Multi-Agent Systems. arXiv:2509.21054 [cs]

work page arXiv 2025
[32]

Xiaochen Zhu et al. 2025. Conformity in Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 3854–

2025
[33]

doi:10.18653/v1/2025.acl-long.195

work page doi:10.18653/v1/2025.acl-long.195 2025